Capstone Project¶

Notebook 2: Exploratory Data Analysis, Data Preprocessing, Modeling¶

  • Tia Plagata

  • https://github.com/tiaplagata/capstone-project

  • tiaplagata@gmail.com

  • Interactive Dash App

  • Blog Post on Creating Beautiful Word Clouds

Table of Contents

  • 1  Capstone Project
    • 1.1  Notebook 2: Exploratory Data Analysis, Data Preprocessing, Modeling
  • 2  Project Overview
      • 2.0.1  Methodology & Data Used
    • 2.1  Explore/clean the data
      • 2.1.1  Duplicates
      • 2.1.2  Class Imbalance
    • 2.2  Text Cleaning & Preprocessing & More Exploration
      • 2.2.1  Look at different vectorization strategies
    • 2.3  Removing Noise from the Data
      • 2.3.1  Make Nicer Word Clouds
      • 2.3.2  Most Frequent Words Visualizations
    • 2.4  Modeling
      • 2.4.1  Baseline Naive Bayes Model
      • 2.4.2  Naive Bayes Iteration 2
      • 2.4.3  Iteration 3: What happens if I take the city names out?
      • 2.4.4  Iteration 4: Try using Count Vectorization
      • 2.4.5  Iteration 5: Try using Bi-Grams
      • 2.4.6  Iteration 6: Try using a Random Forest Model
      • 2.4.7  Try out iteration 3 without lemmatization
    • 2.5  Test out the model
      • 2.5.1  Make this process into a pipeline
      • 2.5.2  Make Pipeline and Gridsearch for Random Forest
      • 2.5.3  Get top 2 predictions from best model
    • 2.6  Conclusion
      • 2.6.1  Model Fit & Score
      • 2.6.2  Business Recommendations
      • 2.6.3  Next Steps -- Dash App

Project Overview¶

The COVID-19 pandemic has severely affected the travel industry. International travel has been impacted, and in turn travel companies and travel websites have lost much of their engagement.

However, with the development of new vaccines for the virus, there is hope on the horizon for international travel and a time where life is somewhat back to normal. In order to increase engagement in the travel industry and increase excitement about travel opportunities, the Destination Dictionary was born!

The Destination Dictionary is a data product that allows future travelers to get a prediction for their perfect destination with the input of just a few words. Trained on over 28,000 unique text data points, the Destination Dictionary is able to predict a destination from 12 different popular cities with 81% accuracy based on text input of activities you want to do while on vacation.

Methodology & Data Used¶

This project utilized data from 12 top cities from TripAdvisor's list of Traveler's Choice destinations for Popular World Destinations 2020, which can be found via this link. The dataset was compiled by scraping the titles from Tripadvisor 'attractions' for each of the 12 cities. The final dataset included over 28,000 unique text values.

In [1]:
# Import Statements

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

import string
import regex as re
import spacy

from nltk.corpus import stopwords
# nltk.download('stopwords')
# nltk.download('punkt')
from nltk import word_tokenize
from nltk import FreqDist

import warnings
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, accuracy_score, f1_score, confusion_matrix, classification_report
from sklearn.utils import class_weight
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from sklearn import set_config

from PIL import Image
from wordcloud import WordCloud
from textwrap import wrap

import joblib
In [2]:
# Read in the DataFrame I created in the Data Collection notebook 
df = pd.read_csv('..\Data\cities_df', index_col=0)
df.head()
df.describe()
Out[2]:
Attraction City
count 28379 28379
unique 27466 12
top Desert Safari Dubai Bali, Indonesia
freq 15 5000

Explore/clean the data¶

  • Decide whether or not to get rid of duplicates
  • Check out the class imbalance
  • No need to worry about null values because I scraped this dataset myself
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 28379 entries, 0 to 3693
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Attraction  28379 non-null  object
 1   City        28379 non-null  object
dtypes: object(2)
memory usage: 665.1+ KB
In [4]:
df.describe()
Out[4]:
Attraction City
count 28379 28379
unique 27466 12
top Desert Safari Dubai Bali, Indonesia
freq 15 5000
In [5]:
df.shape
Out[5]:
(28379, 2)
In [6]:
#No null values because I scraped everything myself. Just to double-check:
df.isna().sum()
Out[6]:
Attraction    0
City          0
dtype: int64

Duplicates¶

In [7]:
df.duplicated().sum()
Out[7]:
846
In [8]:
# Look at duplicates in one city
df[(df.duplicated()==True) & (df['City']=='London, United Kingdom')]
Out[8]:
Attraction City
185 Windsor Castle, Stonehenge, and Oxford Day Tri... London, United Kingdom
241 Jack the Ripper Walking Tour in London London, United Kingdom
259 British Museum Guided Tour London, United Kingdom
618 Private Transfer from Heathrow Airport to London London, United Kingdom
704 Oxford City Full-Day Private Tour from London London, United Kingdom
705 Bath and Stonehenge Full-Day Private Tour from... London, United Kingdom
706 Full-Day Private Guided Tour of Cambridge London, United Kingdom
716 Full-Day Private Tour of Brighton London, United Kingdom
900 PRIVATE Jack the Ripper Ghost Walking Tour in ... London, United Kingdom
990 London to Southampton Cruise Terminals Private... London, United Kingdom
1074 Liverpool the Beatles Legend Fab Four and Manc... London, United Kingdom
1365 Private Full-Day Tour of Shakespeare's Stratfo... London, United Kingdom
1367 Bournemouth and Durdle Door Jurassic Full Day ... London, United Kingdom
1369 Full-Day Private Tour to the Historic Naval Ci... London, United Kingdom
1468 Full-Day Private Fun Cultural Guided Tour of L... London, United Kingdom
1469 London Royal's Full Day Tour London, United Kingdom
1472 Changing of the Guard Half-Day Private Walking... London, United Kingdom
1473 3-Hour Guided Tour of Science Museum in London London, United Kingdom
1493 London's City Lights by Night Private Tour London, United Kingdom
1494 Theme Parks of London Chessington Full-Day Pri... London, United Kingdom
1502 J.R.R. Tolkien's Oxford and Stonehenge Private... London, United Kingdom
1504 Private Layover Tour from London City Airport London, United Kingdom
1516 London Full-Day Private Shore Excursion from S... London, United Kingdom
1517 2-Day Private Wales Tour to Cardiff and Aberfa... London, United Kingdom
1522 London Full Day Private Tour by Walking and Pu... London, United Kingdom
1536 Oxford City and Cotswolds Private Tour London, United Kingdom
1537 Salisbury Magna Carta Stonehenge and Bath Priv... London, United Kingdom
1542 The Golden Triangle Tour | London-Oxford-Cambr... London, United Kingdom
1606 London Skyline Tour London, United Kingdom
1612 Wimbledon Tennis and Museum Tour London, United Kingdom
1613 London Shopping Experience Tour London, United Kingdom
1622 Freestyle Football Workshop in England London, United Kingdom
1675 The Crown Netflix TV London Half Day Private Tour London, United Kingdom
1679 007 James Bond's London Private Half Day Tour London, United Kingdom
1732 4 Hour Tour Harry Potter Locations In London (... London, United Kingdom
1812 Private Chauffeured Minivan at Your Disposal i... London, United Kingdom
1846 Canterbury Cathedral and Leeds Castle Private ... London, United Kingdom
1881 Windsor Castle Heathrow Airport Private Layover London, United Kingdom
1882 Young Victoria's London: Windsor Castle & Kens... London, United Kingdom
1920 9Hr Tour London Eye, Westminster Abbey and St ... London, United Kingdom
1930 Essential London Full-Day Private Tour by Publ... London, United Kingdom
1952 Heathrow Airport Transfer London, United Kingdom
1961 Sherlock Holmes Walking Tour in London London, United Kingdom
1974 Royal London Walking Tour London, United Kingdom
2074 1066 Battle of Hastings, Birling Gap and Seven... London, United Kingdom
2120 Full Day Traditional Private London Tour by Wa... London, United Kingdom
2140 Zoom online tour of London London, United Kingdom
2230 London Underground 2-Hour Tube Tour London, United Kingdom
2282 London to Southampton Cruise Terminals Private... London, United Kingdom
2285 Departure Private Transfers from London City t... London, United Kingdom
2291 4 Hour Tour Tower of London and St Pauls Cathe... London, United Kingdom
2304 Warner Bros' Making of Harry Potter Studio Tour London, United Kingdom
2342 Arrival Private Transfers from London Railway ... London, United Kingdom
2344 Beautiful Cornwall Two Days Private Tour London, United Kingdom
2350 Jack the Ripper Mystery Walks London, United Kingdom
2562 4 Hour Tour London Highlights with Private To... London, United Kingdom
2618 The London Landmarks London, United Kingdom
2735 Afternoon tea bus tour in London London, United Kingdom
2772 Full Day London Pick & Mix Customized Tour London, United Kingdom
2773 A Day at the Museum - Natural History Museum L... London, United Kingdom
In [9]:
df[df['Attraction']=='Oxford City Full-Day Private Tour from London']
Out[9]:
Attraction City
431 Oxford City Full-Day Private Tour from London London, United Kingdom
704 Oxford City Full-Day Private Tour from London London, United Kingdom
In [10]:
# What about the top attraction?
df[df['Attraction']=='Desert Safari Dubai']
Out[10]:
Attraction City
414 Desert Safari Dubai Dubai, United Arab Emirates
478 Desert Safari Dubai Dubai, United Arab Emirates
811 Desert Safari Dubai Dubai, United Arab Emirates
974 Desert Safari Dubai Dubai, United Arab Emirates
998 Desert Safari Dubai Dubai, United Arab Emirates
1001 Desert Safari Dubai Dubai, United Arab Emirates
1689 Desert Safari Dubai Dubai, United Arab Emirates
1718 Desert Safari Dubai Dubai, United Arab Emirates
1722 Desert Safari Dubai Dubai, United Arab Emirates
1944 Desert Safari Dubai Dubai, United Arab Emirates
2301 Desert Safari Dubai Dubai, United Arab Emirates
2514 Desert Safari Dubai Dubai, United Arab Emirates
2854 Desert Safari Dubai Dubai, United Arab Emirates
3101 Desert Safari Dubai Dubai, United Arab Emirates
3426 Desert Safari Dubai Dubai, United Arab Emirates

Clearly, I need to remove duplicates here, because there are some exact duplicates for certain cities.

In [11]:
df = df.drop_duplicates()
In [12]:
# df.to_csv('../Data/cities_cleaned')

Class Imbalance¶

In [13]:
display(df.City.unique())
print('Total Unique Cities:', len(df.City.unique()))
array(['London, United Kingdom', 'Paris, France', 'Crete, Greece',
       'Bali, Indonesia', 'Rome, Italy', 'Phuket, Thailand',
       'Sicily, Italy', 'Majorca, Balearic Islands', 'Barcelona, Spain',
       'Istanbul, Turkey', 'Goa, India', 'Dubai, United Arab Emirates'],
      dtype=object)
Total Unique Cities: 12
In [14]:
df.City.value_counts(normalize=True)
Out[14]:
City
Bali, Indonesia                0.177823
Rome, Italy                    0.174990
Dubai, United Arab Emirates    0.123597
London, United Kingdom         0.099626
Paris, France                  0.096938
Istanbul, Turkey               0.081248
Sicily, Italy                  0.071986
Barcelona, Spain               0.066502
Phuket, Thailand               0.041260
Crete, Greece                  0.037010
Majorca, Balearic Islands      0.016489
Goa, India                     0.012530
Name: proportion, dtype: float64
In [15]:
cities = df.groupby('City').count()
In [16]:
cities.reset_index(inplace=True)
In [17]:
sorted_cities = cities.sort_values(by='Attraction', ascending=False)
sorted_cities
Out[17]:
City Attraction
0 Bali, Indonesia 4896
10 Rome, Italy 4818
3 Dubai, United Arab Emirates 3403
6 London, United Kingdom 2743
8 Paris, France 2669
5 Istanbul, Turkey 2237
11 Sicily, Italy 1982
1 Barcelona, Spain 1831
9 Phuket, Thailand 1136
2 Crete, Greece 1019
7 Majorca, Balearic Islands 454
4 Goa, India 345
In [18]:
# Plot the class imbalance
plt.figure(figsize=(10,8))
sns.barplot(x='Attraction', y='City', data=sorted_cities)
plt.title('Attractions Per City')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

This will likely be an issue when modeling, so I will try to use class weights to fix this problem.

Text Cleaning & Preprocessing & More Exploration¶

  • Remove punctuation and numbers
  • Lowercase everything
  • Remove stopwords
  • Create a document term matrix grouped by city
    • count vectorization
    • tf-idf vectorization
    • bi-grams
  • Visualize most frequent words
    • word clouds
    • bar plot/histogram
In [19]:
# Create a list of stopwords
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
In [20]:
# Preview the list
stopwords_list[:5]
Out[20]:
['i', 'me', 'my', 'myself', 'we']
In [21]:
# Save stopwords list for app
joblib.dump(stopwords_list, '../Data/stopwords_list')
Out[21]:
['../Data/stopwords_list']
In [22]:
# Lowercase all words in each corpus
df['cleaned'] = df['Attraction'].apply(lambda x: x.lower())
df.head()
Out[22]:
Attraction City cleaned
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hop-on hop-off tour and river c...
4 The Blood and Tears Walk: Serial Killers and L... London, United Kingdom the blood and tears walk: serial killers and l...
In [23]:
# Remove commas, hyphens, colons, and other punctuation
df['cleaned'] = df['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
df.head()
Out[23]:
Attraction City cleaned
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hopon hopoff tour and river cru...
4 The Blood and Tears Walk: Serial Killers and L... London, United Kingdom the blood and tears walk serial killers and lo...
In [24]:
# Use regex to get rid of numbers 
df['cleaned'] = df['cleaned'].apply(lambda x: re.sub('\w*\d\w*','', x))
df.head(10)
Out[24]:
Attraction City cleaned
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hopon hopoff tour and river cru...
4 The Blood and Tears Walk: Serial Killers and L... London, United Kingdom the blood and tears walk serial killers and lo...
5 London Ghost and Infamous Murders Walking Tour London, United Kingdom london ghost and infamous murders walking tour
6 Stonehenge, Windsor Castle, and Bath from London London, United Kingdom stonehenge windsor castle and bath from london
7 Warner Bros. Studio: The Making of Harry Potte... London, United Kingdom warner bros studio the making of harry potter ...
8 Ghosts, Ghouls & Gallows: London Virtual Tour London, United Kingdom ghosts ghouls gallows london virtual tour
9 High-Speed Thames River RIB Cruise in London London, United Kingdom highspeed thames river rib cruise in london
In [25]:
# !python -m spacy download en
In [26]:
# Lemmatize the text using spacy
nlp = spacy.load('en_core_web_sm')

df['lemmatized'] = df['cleaned'].apply(lambda x: ' '.join(
                                    [token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
df.head(10)
Out[26]:
Attraction City cleaned lemmatized
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london jack ripper walking tour london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london ghost bus tour london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hopon hopoff tour and river cru... big bus london hopon hopoff tour river cruise ...
4 The Blood and Tears Walk: Serial Killers and L... London, United Kingdom the blood and tears walk serial killers and lo... blood tear walk serial killer london horror
5 London Ghost and Infamous Murders Walking Tour London, United Kingdom london ghost and infamous murders walking tour london ghost infamous murder walk tour
6 Stonehenge, Windsor Castle, and Bath from London London, United Kingdom stonehenge windsor castle and bath from london stonehenge windsor castle bath london
7 Warner Bros. Studio: The Making of Harry Potte... London, United Kingdom warner bros studio the making of harry potter ... warner bros studio making harry potter luxury ...
8 Ghosts, Ghouls & Gallows: London Virtual Tour London, United Kingdom ghosts ghouls gallows london virtual tour ghosts ghoul gallow london virtual tour
9 High-Speed Thames River RIB Cruise in London London, United Kingdom highspeed thames river rib cruise in london highspeed thames river rib cruise london
In [27]:
# Group the corpora by city and join them
df_to_group = df[['City', 'lemmatized']]
df_grouped = df_to_group.groupby(by='City').agg(lambda x:' '.join(x))
df_grouped
Out[27]:
lemmatized
City
Bali, Indonesia hotel hotelbali private transfer daytime bali ...
Barcelona, Spain interactive spanish cooking experience barcelo...
Crete, Greece minoans world museum cinema crete wine ol...
Dubai, United Arab Emirates premium red dune camel safari bbq al khayma ...
Goa, India fontainhas heritage walk sunset cruise paradis...
Istanbul, Turkey bosphorus sunset cruise luxury yacht istanbu...
London, United Kingdom sea life london aquarium admission ticket jack...
Majorca, Balearic Islands cave genova admission palma de mallorca shore ...
Paris, France bateaux parisien seine river gourmet dinner ...
Phuket, Thailand phi phi maiton khai island speedboat phi phi...
Rome, Italy fast skiptheline vatican sistine chapel st pet...
Sicily, Italy etna taormina fullday tour catania palermo str...
In [28]:
# Save grouped df
df_grouped.to_csv('../Data/df_grouped')

Look at different vectorization strategies¶

  • Try different vectorization strategies and visualize them with word clouds
    • count vectorization
    • tf-idf vectorization
    • bi-grams
In [30]:
# Create a document term matrix using count vectorization
# Using count vectorization (most simple way to vectorize)
cv = CountVectorizer(analyzer='word', stop_words=stopwords_list)
data = cv.fit_transform(df_grouped['lemmatized'])
df_dtm = pd.DataFrame(data.toarray(), columns=cv.get_feature_names_out())
df_dtm.index = df_grouped.index
df_dtm
Out[30]:
aal abandon abant abba abbate abbey abbeyprivate abbeyst aberfan abian ... تشيتا خصوصي دي روما فورميا كاستيلوا مدينة من ميرتيتوا نقل
City
Bali, Indonesia 0 2 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
Barcelona, Spain 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Crete, Greece 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Dubai, United Arab Emirates 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Goa, India 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Istanbul, Turkey 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
London, United Kingdom 0 0 0 1 0 61 1 2 1 0 ... 0 0 0 0 0 0 0 0 0 0
Majorca, Balearic Islands 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Paris, France 0 0 0 0 0 5 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Phuket, Thailand 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Rome, Italy 0 0 0 0 0 3 0 0 0 0 ... 1 3 1 3 1 1 6 3 1 3
Sicily, Italy 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

12 rows × 8506 columns

In [31]:
# Create a document term matrix using TF-IDF vectorization
# Might be good for classifying cities
tfidf = TfidfVectorizer(analyzer='word', stop_words=stopwords_list)
data2 = tfidf.fit_transform(df_grouped['lemmatized'])
df_dtm2 = pd.DataFrame(data2.toarray(), columns=tfidf.get_feature_names_out())
df_dtm2.index = df_grouped.index
df_dtm2
Out[31]:
aal abandon abant abba abbate abbey abbeyprivate abbeyst aberfan abian ... تشيتا خصوصي دي روما فورميا كاستيلوا مدينة من ميرتيتوا نقل
City
Bali, Indonesia 0.000000 0.000699 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000407 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Barcelona, Spain 0.000737 0.000000 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Crete, Greece 0.000000 0.000000 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Dubai, United Arab Emirates 0.000000 0.000309 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Goa, India 0.000000 0.000000 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Istanbul, Turkey 0.000000 0.000000 0.000589 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
London, United Kingdom 0.000000 0.000000 0.000000 0.0005 0.000000 0.023151 0.0005 0.001001 0.0005 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Majorca, Balearic Islands 0.000000 0.000000 0.000000 0.0000 0.003344 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Paris, France 0.000000 0.000000 0.000000 0.0000 0.000000 0.002363 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Phuket, Thailand 0.000000 0.000000 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Rome, Italy 0.000000 0.000000 0.000000 0.0000 0.000000 0.000944 0.0000 0.000000 0.0000 0.000000 ... 0.000415 0.001244 0.000415 0.001244 0.000415 0.000415 0.002488 0.001244 0.000415 0.001244
Sicily, Italy 0.000000 0.000000 0.000000 0.0000 0.000000 0.000000 0.0000 0.000000 0.0000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

12 rows × 8506 columns

Word Clouds with Count Vectorization¶

In [32]:
def generate_wordcloud(data, title):
    cloud = WordCloud(width=400, height=330, max_words=150, colormap='tab20c').generate_from_frequencies(data)
    plt.figure(figsize=(10,8))
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('\n'.join(wrap(title,60)), fontsize=13)
    plt.show()
In [33]:
# Transposing document term matrix
df_dtm = df_dtm.transpose()

# Plotting word cloud for each city
for index, city in enumerate(df_dtm.columns):
    generate_wordcloud(df_dtm[city].sort_values(ascending=False), city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [35]:
# Look at top words with count vectorizer (in total, not per city)
sum_words = data.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
words_freq
Out[35]:
[('tour', 13527),
 ('private', 8730),
 ('transfer', 3613),
 ('day', 3533),
 ('airport', 2945),
 ('rome', 2814),
 ('bali', 2359),
 ('dubai', 2331),
 ('city', 2130),
 ('london', 2075),
 ('paris', 1871),
 ('guide', 1602),
 ('istanbul', 1473),
 ('barcelona', 1244),
 ('trip', 1153),
 ('safari', 1056),
 ('dinner', 992),
 ('desert', 967),
 ('ubud', 926),
 ('cruise', 923),
 ('walk', 921),
 ('experience', 884),
 ('ticket', 847),
 ('museum', 817),
 ('hotel', 796),
 ('lunch', 788),
 ('group', 730),
 ('good', 715),
 ('local', 713),
 ('colosseum', 698),
 ('vatican', 696),
 ('phuket', 644),
 ('class', 642),
 ('food', 638),
 ('fullday', 635),
 ('car', 630),
 ('island', 624),
 ('ride', 624),
 ('temple', 619),
 ('bike', 589),
 ('wine', 587),
 ('small', 582),
 ('night', 582),
 ('include', 579),
 ('taste', 576),
 ('line', 554),
 ('hour', 545),
 ('luxury', 536),
 ('skip', 528),
 ('adventure', 495),
 ('half', 478),
 ('palermo', 473),
 ('waterfall', 446),
 ('abu', 434),
 ('sightseeing', 431),
 ('dhabi', 431),
 ('highlight', 423),
 ('port', 423),
 ('excursion', 421),
 ('boat', 419),
 ('sunset', 418),
 ('visit', 397),
 ('water', 393),
 ('roman', 392),
 ('beach', 376),
 ('palace', 375),
 ('driver', 373),
 ('chapel', 370),
 ('arrival', 369),
 ('dune', 368),
 ('package', 366),
 ('morning', 359),
 ('sistine', 356),
 ('camel', 355),
 ('bbq', 355),
 ('quad', 354),
 ('park', 341),
 ('phi', 334),
 ('taormina', 334),
 ('skiptheline', 332),
 ('etna', 331),
 ('departure', 329),
 ('cooking', 327),
 ('catania', 326),
 ('sunrise', 325),
 ('vip', 321),
 ('river', 316),
 ('walking', 314),
 ('batur', 311),
 ('cappadocia', 308),
 ('de', 307),
 ('village', 301),
 ('evening', 299),
 ('smallgroup', 299),
 ('forum', 293),
 ('raft', 291),
 ('family', 291),
 ('civitavecchia', 288),
 ('home', 279),
 ('halfday', 277),
 ('explore', 275),
 ('lot', 272),
 ('nusa', 271),
 ('atv', 267),
 ('trek', 265),
 ('ancient', 265),
 ('castle', 265),
 ('tanah', 260),
 ('art', 260),
 ('burj', 260),
 ('shore', 257),
 ('st', 257),
 ('market', 253),
 ('old', 253),
 ('heathrow', 253),
 ('mount', 251),
 ('center', 249),
 ('goa', 248),
 ('versa', 245),
 ('street', 243),
 ('swing', 240),
 ('rice', 237),
 ('vice', 236),
 ('exclusive', 233),
 ('crete', 231),
 ('tower', 230),
 ('volcano', 229),
 ('amazing', 229),
 ('turkey', 224),
 ('mallorca', 224),
 ('discover', 223),
 ('chania', 219),
 ('service', 216),
 ('uluwatu', 214),
 ('penida', 213),
 ('red', 212),
 ('ephesus', 212),
 ('access', 210),
 ('louvre', 209),
 ('hill', 207),
 ('pickup', 204),
 ('fiumicino', 204),
 ('entrance', 199),
 ('terrace', 198),
 ('bus', 196),
 ('bosphorus', 190),
 ('van', 189),
 ('photo', 185),
 ('heraklion', 185),
 ('khalifa', 185),
 ('pick', 182),
 ('pasta', 182),
 ('premium', 179),
 ('traditional', 177),
 ('kintamani', 177),
 ('dance', 176),
 ('live', 176),
 ('garden', 176),
 ('kid', 176),
 ('gate', 175),
 ('central', 175),
 ('east', 173),
 ('blue', 171),
 ('tiramisu', 171),
 ('minivan', 167),
 ('world', 166),
 ('photographer', 165),
 ('pamukkale', 164),
 ('snorkeling', 163),
 ('al', 163),
 ('cesarinas', 163),
 ('palatine', 162),
 ('charter', 160),
 ('windsor', 160),
 ('hot', 157),
 ('cdg', 156),
 ('royal', 155),
 ('secret', 154),
 ('share', 153),
 ('marina', 152),
 ('palma', 152),
 ('villa', 149),
 ('tasting', 149),
 ('white', 148),
 ('jungle', 148),
 ('free', 148),
 ('way', 148),
 ('history', 148),
 ('mosque', 148),
 ('area', 147),
 ('vehicle', 146),
 ('basilica', 146),
 ('combo', 145),
 ('dhow', 145),
 ('shopping', 142),
 ('professional', 142),
 ('heaven', 141),
 ('spa', 141),
 ('rental', 140),
 ('sagrada', 140),
 ('sea', 139),
 ('inclusive', 138),
 ('english', 138),
 ('afternoon', 137),
 ('pizza', 137),
 ('stonehenge', 137),
 ('sand', 136),
 ('familia', 136),
 ('taxi', 135),
 ('photography', 134),
 ('transport', 132),
 ('heritage', 132),
 ('fast', 131),
 ('bay', 130),
 ('trekking', 129),
 ('people', 129),
 ('sicily', 129),
 ('montserrat', 127),
 ('workshop', 126),
 ('enjoy', 125),
 ('culture', 124),
 ('admission', 123),
 ('classic', 123),
 ('roundtrip', 123),
 ('boarding', 123),
 ('bath', 122),
 ('pompeii', 122),
 ('coast', 121),
 ('segway', 120),
 ('business', 120),
 ('eiffel', 120),
 ('versaille', 120),
 ('speedboat', 119),
 ('cook', 118),
 ('montmartre', 118),
 ('instagram', 117),
 ('entry', 117),
 ('quarter', 117),
 ('balloon', 116),
 ('fun', 115),
 ('audio', 115),
 ('forest', 114),
 ('minute', 114),
 ('virtual', 114),
 ('town', 114),
 ('massage', 113),
 ('grand', 113),
 ('breakfast', 112),
 ('champagne', 112),
 ('speed', 111),
 ('drive', 110),
 ('balinese', 110),
 ('view', 110),
 ('cycling', 109),
 ('option', 109),
 ('james', 109),
 ('lempuyang', 108),
 ('underground', 108),
 ('messina', 107),
 ('la', 106),
 ('bash', 106),
 ('bond', 106),
 ('ayung', 104),
 ('new', 104),
 ('expert', 104),
 ('train', 104),
 ('ciampino', 104),
 ('ski', 103),
 ('tapas', 103),
 ('spring', 102),
 ('south', 102),
 ('north', 101),
 ('sailing', 101),
 ('romantic', 100),
 ('round', 99),
 ('course', 99),
 ('flight', 99),
 ('optional', 99),
 ('chauffeur', 98),
 ('ebike', 98),
 ('sitia', 98),
 ('west', 97),
 ('buggy', 97),
 ('yacht', 97),
 ('valley', 97),
 ('fountain', 97),
 ('catacomb', 97),
 ('lagoon', 96),
 ('monkey', 96),
 ('buffet', 96),
 ('electric', 96),
 ('early', 96),
 ('charles', 96),
 ('seine', 96),
 ('ferrari', 95),
 ('natural', 94),
 ('hide', 94),
 ('max', 94),
 ('peters', 94),
 ('wifi', 93),
 ('historical', 93),
 ('cultural', 93),
 ('session', 93),
 ('semiprivate', 93),
 ('fco', 93),
 ('diving', 92),
 ('rafting', 92),
 ('travel', 92),
 ('french', 92),
 ('westminster', 92),
 ('escape', 91),
 ('jewish', 91),
 ('southampton', 91),
 ('florence', 91),
 ('jsh', 90),
 ('countryside', 89),
 ('open', 88),
 ('lake', 88),
 ('pass', 87),
 ('special', 86),
 ('photoshoot', 86),
 ('di', 86),
 ('make', 85),
 ('floor', 85),
 ('orly', 85),
 ('syracuse', 85),
 ('jeep', 83),
 ('mountain', 83),
 ('express', 83),
 ('winery', 83),
 ('oxford', 83),
 ('nature', 82),
 ('black', 82),
 ('shuttle', 82),
 ('gallery', 82),
 ('activity', 81),
 ('house', 81),
 ('tivoli', 81),
 ('fishing', 80),
 ('beautiful', 80),
 ('gaulle', 80),
 ('amalfi', 80),
 ('agrigento', 80),
 ('coffee', 79),
 ('cave', 79),
 ('track', 79),
 ('show', 79),
 ('turkish', 79),
 ('spot', 78),
 ('jatiluwih', 78),
 ('cefalù', 78),
 ('cta', 78),
 ('semi', 77),
 ('time', 77),
 ('harry', 77),
 ('authentic', 76),
 ('camp', 76),
 ('site', 76),
 ('hidden', 75),
 ('hike', 75),
 ('gatwick', 75),
 ('air', 74),
 ('kecak', 74),
 ('big', 74),
 ('tofrom', 74),
 ('jet', 73),
 ('transportation', 73),
 ('sight', 73),
 ('arena', 73),
 ('sicilian', 73),
 ('modern', 72),
 ('pantheon', 72),
 ('snorkel', 71),
 ('fromto', 71),
 ('plane', 71),
 ('scuba', 70),
 ('game', 70),
 ('historic', 70),
 ('christmas', 70),
 ('layover', 70),
 ('dxb', 70),
 ('potter', 70),
 ('mt', 69),
 ('land', 69),
 ('gourmet', 69),
 ('abbey', 69),
 ('discovery', 68),
 ('gorge', 68),
 ('italian', 68),
 ('pmo', 68),
 ('daily', 67),
 ('canoe', 67),
 ('district', 67),
 ('le', 67),
 ('sabiha', 67),
 ('san', 66),
 ('cathedral', 66),
 ('rethymno', 66),
 ('sophia', 66),
 ('disneyland', 66),
 ('hire', 65),
 ('fire', 65),
 ('light', 65),
 ('treasure', 65),
 ('centre', 65),
 ('plus', 65),
 ('gothic', 65),
 ('nga', 65),
 ('hunt', 64),
 ('vacation', 64),
 ('room', 64),
 ('station', 64),
 ('hagia', 64),
 ('phang', 64),
 ('lesson', 63),
 ('customize', 63),
 ('international', 63),
 ('costa', 63),
 ('british', 63),
 ('unique', 62),
 ('friendly', 62),
 ('accommodation', 62),
 ('brava', 62),
 ('knossos', 62),
 ('square', 62),
 ('khai', 62),
 ('noto', 62),
 ('dps', 61),
 ('horse', 61),
 ('vintage', 61),
 ('heart', 61),
 ('creta', 61),
 ('trastevere', 61),
 ('allinclusive', 60),
 ('gaudi', 60),
 ('arab', 60),
 ('creek', 60),
 ('belly', 60),
 ('monreale', 60),
 ('club', 59),
 ('ultimate', 59),
 ('love', 59),
 ('thai', 59),
 ('pmi', 59),
 ('kuta', 58),
 ('chef', 58),
 ('panoramic', 58),
 ('min', 58),
 ('medieval', 58),
 ('giverny', 58),
 ('executive', 57),
 ('luton', 57),
 ('saint', 57),
 ('tuscany', 57),
 ('sport', 56),
 ('jimbaran', 55),
 ('pax', 55),
 ('meal', 55),
 ('rock', 55),
 ('hopon', 55),
 ('catamaran', 55),
 ('sandboarde', 55),
 ('zaye', 55),
 ('church', 55),
 ('normandy', 55),
 ('castel', 55),
 ('fly', 54),
 ('rai', 54),
 ('bedugul', 54),
 ('attraction', 54),
 ('drop', 54),
 ('scooter', 54),
 ('dining', 54),
 ('drink', 54),
 ('coral', 54),
 ('cheese', 54),
 ('hopoff', 54),
 ('chq', 54),
 ('loire', 54),
 ('telaga', 53),
 ('chocolate', 53),
 ('golf', 53),
 ('shoot', 53),
 ('customer', 53),
 ('hop', 53),
 ('italy', 53),
 ('gokcen', 53),
 ('ngurah', 52),
 ('plantation', 52),
 ('denpasar', 52),
 ('gem', 52),
 ('place', 52),
 ('terminal', 52),
 ('trevi', 52),
 ('orvieto', 52),
 ('dua', 51),
 ('waja', 51),
 ('tirta', 51),
 ('overnight', 51),
 ('star', 51),
 ('hiking', 50),
 ('besakih', 50),
 ('national', 50),
 ('real', 50),
 ('atlantis', 50),
 ('sheikh', 50),
 ('unesco', 49),
 ('cepung', 49),
 ('iconic', 49),
 ('aquarium', 49),
 ('official', 49),
 ('cretan', 49),
 ('frame', 49),
 ('latin', 49),
 ('topkapi', 49),
 ('cabaret', 49),
 ('oil', 48),
 ('journey', 48),
 ('summit', 48),
 ('girona', 48),
 ('cab', 48),
 ('dover', 48),
 ('borghese', 48),
 ('sorrento', 48),
 ('great', 47),
 ('location', 47),
 ('cotswold', 47),
 ('lhr', 47),
 ('piazza', 47),
 ('tukad', 46),
 ('self', 46),
 ('dive', 46),
 ('return', 46),
 ('pub', 46),
 ('spanish', 46),
 ('bcn', 46),
 ('independent', 46),
 ('antalya', 46),
 ('baroque', 46),
 ('like', 45),
 ('vespa', 45),
 ('peter', 45),
 ('deste', 45),
 ('ostia', 45),
 ('godfather', 45),
 ('seminyak', 44),
 ('custom', 44),
 ('magic', 44),
 ('bazaar', 44),
 ('ory', 44),
 ('ghetto', 44),
 ('siracusa', 44),
 ('point', 43),
 ('kayak', 43),
 ('troy', 43),
 ('trapani', 43),
 ('rent', 42),
 ('sedan', 42),
 ('selfguide', 42),
 ('miracle', 42),
 ('vineyard', 42),
 ('appian', 42),
 ('restaurant', 41),
 ('riding', 41),
 ('offer', 41),
 ('withlocal', 41),
 ('tea', 41),
 ('high', 41),
 ('passenger', 41),
 ('personalize', 41),
 ('olive', 41),
 ('ist', 41),
 ('churchill', 41),
 ('koh', 41),
 ('castelmola', 41),
 ('personal', 40),
 ('surf', 40),
 ('wonderful', 40),
 ('wonder', 40),
 ('holiday', 40),
 ('online', 40),
 ('stop', 40),
 ('sitge', 40),
 ('viceversa', 40),
 ('wild', 40),
 ('gelato', 40),
 ('gallipoli', 40),
 ('war', 40),
 ('ortigia', 40),
 ('lembongan', 39),
 ('vw', 39),
 ('canyon', 39),
 ('magical', 39),
 ('ghost', 39),
 ('see', 39),
 ('mercede', 39),
 ('entertainment', 39),
 ('canal', 39),
 ('bursa', 39),
 ('kidfriendly', 39),
 ('parisian', 39),
 ('orsay', 39),
 ('modica', 39),
 ('erice', 39),
 ('minibus', 38),
 ('yoga', 38),
 ('mini', 38),
 ('horseback', 38),
 ('wadi', 38),
 ('sharjah', 38),
 ('islands', 38),
 ('ragusa', 38),
 ('famous', 37),
 ('trail', 37),
 ('learn', 37),
 ('el', 37),
 ('golden', 37),
 ('ottoman', 37),
 ('court', 37),
 ('assisi', 37),
 ('dolphin', 36),
 ('speaking', 36),
 ('bird', 36),
 ('eye', 36),
 ('basis', 36),
 ('postcode', 36),
 ('dday', 36),
 ('marais', 36),
 ('racha', 36),
 ('positano', 36),
 ('navona', 36),
 ('alcantara', 36),
 ('safe', 35),
 ('incl', 35),
 ('rover', 35),
 ('del', 35),
 ('casa', 35),
 ('outlet', 35),
 ('archaeological', 35),
 ('krabi', 35),
 ('roma', 35),
 ('papal', 35),
 ('naples', 35),
 ('beginner', 34),
 ('holy', 34),
 ('honeymoon', 34),
 ('treatment', 34),
 ('beauty', 34),
 ('danu', 34),
 ('complete', 34),
 ('spice', 34),
 ('western', 34),
 ('helicopter', 34),
 ('monastery', 34),
 ('region', 34),
 ('global', 34),
 ('mediterranean', 34),
 ('audioguide', 34),
 ('path', 34),
 ('music', 34),
 ('bridge', 34),
 ('naxos', 34),
 ('person', 33),
 ('exploration', 33),
 ('seafood', 33),
 ('shop', 33),
 ('palm', 33),
 ('host', 33),
 ('platform', 33),
 ('hampton', 33),
 ('pastry', 33),
 ('simon', 33),
 ('gladiator', 33),
 ('ulun', 32),
 ('mystery', 32),
 ('meet', 32),
 ('price', 32),
 ('daytrip', 32),
 ('priority', 32),
 ('cava', 32),
 ('welcome', 32),
 ('story', 32),
 ('mall', 32),
 ('emirate', 32),
 ('jack', 32),
 ('siena', 32),
 ('cia', 32),
 ('scenic', 31),
 ('sekumpul', 31),
 ('sanur', 31),
 ('life', 31),
 ('bamboo', 31),
 ('zipline', 31),
 ('lose', 31),
 ('hrs', 31),
 ('hr', 31),
 ('landmark', 31),
 ('musée', 31),
 ('tegalalang', 30),
 ('field', 30),
 ('canggu', 30),
 ('crater', 30),
 ('cocktail', 30),
 ('santa', 30),
 ('footstep', 30),
 ('christian', 30),
 ('venice', 30),
 ('pauls', 30),
 ('da', 30),
 ('empul', 29),
 ('green', 29),
 ('destination', 29),
 ('seat', 29),
 ('studio', 29),
 ('foodie', 29),
 ('bicycle', 29),
 ('step', 29),
 ('france', 29),
 ('samaria', 29),
 ('accessible', 29),
 ('zoo', 29),
 ('immersive', 29),
 ('guard', 29),
 ('tegenungan', 28),
 ('route', 28),
 ('body', 28),
 ('well', 28),
 ('relax', 28),
 ('farm', 28),
 ('organic', 28),
 ('gangga', 28),
 ('country', 28),
 ('picnic', 28),
 ('eat', 28),
 ('prince', 28),
 ('gold', 28),
 ('guell', 28),
 ('license', 28),
 ('cruiser', 28),
 ('chamber', 28),
 ('warner', 28),
 ('pier', 28),
 ('museums', 28),
 ('mont', 28),
 ('dorsay', 28),
 ('segesta', 28),
 ('giardini', 28),
 ('tibumana', 27),
 ('padi', 27),
 ('combination', 27),
 ('taman', 27),
 ('diver', 27),
 ('super', 27),
 ('tubing', 27),
 ('road', 27),
 ('beer', 27),
 ('climb', 27),
 ('end', 27),
 ('flamenco', 27),
 ('europe', 27),
 ('rooftop', 27),
 ('sant', 27),
 ('mustsee', 27),
 ('theme', 27),
 ('agia', 27),
 ('arabian', 27),
 ('bros', 27),
 ('stopover', 27),
 ('byzantine', 27),
 ('imperial', 27),
 ('movie', 27),
 ('cart', 27),
 ('frascati', 27),
 ('fish', 26),
 ('board', 26),
 ('sky', 26),
 ('deep', 26),
 ('ruin', 26),
 ('gaudí', 26),
 ('güell', 26),
 ('stadium', 26),
 ('img', 26),
 ('disposal', 26),
 ('asian', 26),
 ('thame', 26),
 ('greenwich', 26),
 ('soho', 26),
 ('hanuman', 26),
 ('raya', 26),
 ('gili', 25),
 ('change', 25),
 ('cliff', 25),
 ('itinerary', 25),
 ('lovina', 25),
 ('eastern', 25),
 ('eco', 25),
 ('couple', 25),
 ('dropoff', 25),
 ('crawl', 25),
 ('culinary', 25),
 ('original', 25),
 ('andorra', 25),
 ('legend', 25),
 ('architecture', 25),
 ('crypt', 25),
 ('legoland', 25),
 ('anzac', 25),
 ('kensington', 25),
 ('cambridge', 25),
 ('lcy', 25),
 ('monet', 25),
 ('fashion', 25),
 ('moulin', 25),
 ('rouge', 25),
 ('pisa', 25),
 ('antica', 25),
 ('marsala', 25),
 ('capo', 25),
 ('barong', 24),
 ('underwater', 24),
 ('standard', 24),
 ('ijen', 24),
 ('true', 24),
 ('daytour', 24),
 ('resort', 24),
 ('level', 24),
 ('start', 24),
 ('reality', 24),
 ('party', 24),
 ('cable', 24),
 ('mosaic', 24),
 ('beat', 24),
 ('cellar', 24),
 ('rail', 24),
 ('opera', 24),
 ('gardens', 24),
 ('uae', 24),
 ('reserve', 24),
 ('continent', 24),
 ('dolmabahce', 24),
 ('michel', 24),
 ('maiton', 24),
 ('campo', 24),
 ('terracina', 24),
 ('banana', 23),
 ('tanjung', 23),
 ('certify', 23),
 ('healing', 23),
 ('handara', 23),
 ('unforgettable', 23),
 ('sacred', 23),
 ('direct', 23),
 ('bar', 23),
 ('main', 23),
 ('instagramable', 23),
 ('easy', 23),
 ('stay', 23),
 ('wildlife', 23),
 ('lover', 23),
 ('kickstart', 23),
 ('sail', 23),
 ('royalty', 23),
 ('relic', 23),
 ('picasso', 23),
 ('balos', 23),
 ('selfdrive', 23),
 ('bodrum', 23),
 ('battlefield', 23),
 ('beatle', 23),
 ('tate', 23),
 ('shakespeare', 23),
 ('khao', 23),
 ('gimignano', 23),
 ('civita', 23),
 ('paradise', 22),
 ('float', 22),
 ('pool', 22),
 ('mother', 22),
 ('maya', 22),
 ('escooter', 22),
 ('tree', 22),
 ('salt', 22),
 ('sanctuary', 22),
 ('style', 22),
 ('paella', 22),
 ('montjuic', 22),
 ('voicemap', 22),
 ('agio', 22),
 ('nikolaos', 22),
 ('elafonisi', 22),
 ('greek', 22),
 ('ripper', 22),
 ('buckingham', 22),
 ('notre', 22),
 ('fiori', 22),
 ('santangelo', 22),
 ('stpeter', 22),
 ('pope', 22),
 ('bagnoregio', 22),
 ('cesarina', 22),
 ('armerina', 22),
 ('individual', 21),
 ('tulamben', 21),
 ('watch', 21),
 ('camping', 21),
 ('extreme', 21),
 ('southern', 21),
 ('try', 21),
 ('john', 21),
 ('vr', 21),
 ('historian', 21),
 ('santorini', 21),
 ('making', 21),
 ('sapanca', 21),
 ('lgw', 21),
 ('stanste', 21),
 ('cala', 21),
 ('marai', 21),
 ('roissy', 21),
 ('dame', 21),
 ('aperitivo', 21),
 ('pompei', 21),
 ('vulcano', 20),
 ('banyumala', 20),
 ('virgin', 20),
 ('volkswagen', 20),
 ('melangit', 20),
 ('ayun', 20),
 ('saver', 20),
 ('tourist', 20),
 ('magnificent', 20),
 ('butterfly', 20),
 ('capital', 20),
 ('delivery', 20),
 ('long', 20),
 ('week', 20),
 ('flavor', 20),
 ('elounda', 20),
 ('aquaventure', 20),
 ('essential', 20),
 ('horn', 20),
 ('vito', 20),
 ('notte', 20),
 ('salisbury', 20),
 ('fontainebleau', 20),
 ('hong', 20),
 ('paraglide', 19),
 ('popular', 19),
 ('child', 19),
 ('create', 19),
 ('driving', 19),
 ('king', 19),
 ('coach', 19),
 ('unlimited', 19),
 ('near', 19),
 ('outdoor', 19),
 ('pack', 19),
 ('skyline', 19),
 ('ultra', 19),
 ('gramvousa', 19),
 ('jetski', 19),
 ('stanbul', 19),
 ('kusadasi', 19),
 ('camden', 19),
 ('downton', 19),
 ('blenheim', 19),
 ('ltn', 19),
 ('chantilly', 19),
 ('landing', 19),
 ('raphael', 19),
 ('spiritual', 18),
 ('walker', 18),
 ('combine', 18),
 ('incredible', 18),
 ('menjangan', 18),
 ('charm', 18),
 ('artisan', 18),
 ('school', 18),
 ('spectacular', 18),
 ('padang', 18),
 ('paddle', 18),
 ('break', 18),
 ('explorer', 18),
 ('introduction', 18),
 ('nou', 18),
 ('bear', 18),
 ('aperitif', 18),
 ('prat', 18),
 ('gastronomy', 18),
 ('year', 18),
 ('lo', 18),
 ('public', 18),
 ('typical', 18),
 ('jumeirah', 18),
 ('deluxe', 18),
 ('bollywood', 18),
 ('escort', 18),
 ('istanbuls', 18),
 ('tomb', 18),
 ('film', 18),
 ...]

There seems to be a lot of repetition with words like 'tour', 'private', 'transfer' between cities because all of these cities will have airport transfers, and different tours. However, these words will not help in modeling since they are not unique to the cities.

Word Clouds with TF-IDF Vectorization¶

In [36]:
# Transposing document term matrix
df_dtm2 = df_dtm2.transpose()

# Plotting word cloud for each city
for index, city in enumerate(df_dtm2.columns):
    generate_wordcloud(df_dtm2[city].sort_values(ascending=False), city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [37]:
# Look at top words with tf-idf vectorization (for total words, not per city)
sum_words = data2.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in tfidf.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
words_freq
Out[37]:
[('tour', 3.138098562201453),
 ('private', 2.0825423088575428),
 ('transfer', 0.9858659955017881),
 ('goa', 0.9237376001392338),
 ('barcelona', 0.916553640718507),
 ('london', 0.8927533459748753),
 ('paris', 0.8822607586615051),
 ('istanbul', 0.8681465446126231),
 ('airport', 0.8578475408738101),
 ('dubai', 0.8396210065107451),
 ('day', 0.8337294625779452),
 ('phuket', 0.8033367684025245),
 ('mallorca', 0.7490675159701988),
 ('bali', 0.7376600455046178),
 ('rome', 0.6629694357497168),
 ('palma', 0.5082958144083493),
 ('crete', 0.4738788186359383),
 ('palermo', 0.4686929630407816),
 ('city', 0.4658936476639842),
 ('chania', 0.4492617371483571),
 ('phi', 0.4166373923081416),
 ('guide', 0.40388686280511066),
 ('taormina', 0.38589119407720635),
 ('etna', 0.3824251055076506),
 ('heraklion', 0.37951333960021033),
 ('ubud', 0.37703781127932107),
 ('trip', 0.3233023886377564),
 ('catania', 0.32283382386081877),
 ('island', 0.3070577380372553),
 ('desert', 0.3020386322104823),
 ('colosseum', 0.28944923808256634),
 ('vatican', 0.2886198706382037),
 ('walk', 0.24365987791520896),
 ('de', 0.23933024676314693),
 ('experience', 0.2319534359719405),
 ('cruise', 0.21618352115813164),
 ('lunch', 0.21333061964046546),
 ('wine', 0.21188758401881197),
 ('local', 0.2048278137423077),
 ('safari', 0.20369673918616388),
 ('sitia', 0.20103949881524655),
 ('pmi', 0.19729903322429343),
 ('jsh', 0.1846281111568591),
 ('boat', 0.1817521982843795),
 ('cappadocia', 0.18152690817426195),
 ('group', 0.17828984844031226),
 ('hotel', 0.17599974009961786),
 ('dinner', 0.1733772077969627),
 ('taste', 0.17184544657238202),
 ('beach', 0.16892293673888853),
 ('food', 0.1657113173510083),
 ('museum', 0.16320814596959837),
 ('class', 0.16157583542428305),
 ('ticket', 0.15961308835129107),
 ('abu', 0.15632583304404266),
 ('dhabi', 0.15524523972807),
 ('temple', 0.15108841765694975),
 ('good', 0.14991260699179831),
 ('fullday', 0.14907451241799458),
 ('sistine', 0.14762740509655248),
 ('car', 0.14649481120930058),
 ('waterfall', 0.14266739539813184),
 ('small', 0.14212310867441014),
 ('excursion', 0.1397579935608121),
 ('bike', 0.13626248392244555),
 ('rethymno', 0.13539394818169667),
 ('dune', 0.13255278009264446),
 ('turkey', 0.13201956958128142),
 ('include', 0.12947210752955635),
 ('knossos', 0.12718825435250292),
 ('hour', 0.12666415091744057),
 ('batur', 0.1266293297061219),
 ('heathrow', 0.12656693961957438),
 ('sicily', 0.12609052489001765),
 ('luxury', 0.12530136833167274),
 ('creta', 0.1251368308952045),
 ('ephesus', 0.12494709263942706),
 ('messina', 0.12362382564748826),
 ('forum', 0.12150233059912886),
 ('sightseeing', 0.12081825578971062),
 ('half', 0.12028204794221654),
 ('night', 0.1198605068540515),
 ('civitavecchia', 0.11942891198822224),
 ('line', 0.1191687180759041),
 ('adventure', 0.11866336182246942),
 ('chapel', 0.11840169286263912),
 ('skip', 0.11340531581019436),
 ('bosphorus', 0.11198088491269408),
 ('chq', 0.11077686669411546),
 ('ride', 0.1105113947721325),
 ('nusa', 0.1103425991972959),
 ('palace', 0.10909412759248191),
 ('bay', 0.1066070272162302),
 ('atv', 0.10650415161126074),
 ('tanah', 0.10586374830736876),
 ('port', 0.10391251644646027),
 ('roman', 0.10337262748282978),
 ('sagrada', 0.10314912355352972),
 ('visit', 0.10108895816950264),
 ('cretan', 0.10051974940762327),
 ('sunset', 0.10031004474575289),
 ('familia', 0.10020200573771458),
 ('syracuse', 0.09820584280407947),
 ('water', 0.09809938595422275),
 ('swing', 0.0977203830529558),
 ('cdg', 0.09718676221536458),
 ('highlight', 0.09668623444288184),
 ('pamukkale', 0.09665718487200962),
 ('lot', 0.09575574498830418),
 ('old', 0.09513897938071034),
 ('cooking', 0.09508255139014962),
 ('burj', 0.09365142071762922),
 ('montserrat', 0.09357099065213052),
 ('james', 0.09275321738352545),
 ('arrival', 0.09262102690041997),
 ('agrigento', 0.09242902852148654),
 ('mount', 0.09222456436373663),
 ('versa', 0.09193471020794969),
 ('bond', 0.09161465991980078),
 ('raft', 0.0914946585743993),
 ('louvre', 0.09052463772011213),
 ('park', 0.0902286706257994),
 ('cta', 0.09011830280844939),
 ('castle', 0.08961986581289814),
 ('vice', 0.0892880904254558),
 ('quad', 0.08928309829030198),
 ('camel', 0.08836365880525354),
 ('walking', 0.08799472018866634),
 ('village', 0.08726308193527141),
 ('uluwatu', 0.08713400822221891),
 ('penida', 0.08672683995949826),
 ('departure', 0.08609987563389516),
 ('plantation', 0.08557034536354162),
 ('cesarinas', 0.08476639867960487),
 ('fiumicino', 0.08459547932499076),
 ('sicilian', 0.08434148852585648),
 ('tiramisu', 0.08189054566104409),
 ('speedboat', 0.08113455453055947),
 ('nga', 0.08108212724559642),
 ('north', 0.08066114308307175),
 ('home', 0.08064333483861291),
 ('windsor', 0.08004233335625256),
 ('phang', 0.07983470990335648),
 ('skiptheline', 0.07977883762436667),
 ('vip', 0.07880757050025558),
 ('pmo', 0.07856467424326356),
 ('khai', 0.07733987521887659),
 ('shore', 0.07657412667103831),
 ('cefalù', 0.0754863337459489),
 ('gorge', 0.0751395863661337),
 ('versaille', 0.07475904785797276),
 ('south', 0.07382887210508818),
 ('tower', 0.07356086888406942),
 ('montmartre', 0.07351306372700654),
 ('driver', 0.07326392285259319),
 ('kintamani', 0.0720687825015549),
 ('explore', 0.07204851665980237),
 ('noto', 0.07163249710415208),
 ('river', 0.07078409212684715),
 ('cala', 0.07022507962220614),
 ('customer', 0.07015231538566796),
 ('pasta', 0.07005931165573354),
 ('volcano', 0.06998314269752369),
 ('halfday', 0.06992041559812033),
 ('art', 0.06975107863852788),
 ('st', 0.06964905694304915),
 ('sunrise', 0.06959244427840715),
 ('sailing', 0.06944917812016062),
 ('monreale', 0.06932177139111492),
 ('stonehenge', 0.06853624793629126),
 ('dudhsagar', 0.06805384182361718),
 ('bbq', 0.06791800870437857),
 ('rice', 0.0677973632353311),
 ('morning', 0.06771566302617796),
 ('market', 0.06730870319352147),
 ('center', 0.06719598617682139),
 ('smallgroup', 0.06689384872300673),
 ('khalifa', 0.06663658781831311),
 ('spice', 0.06656519584913641),
 ('package', 0.06648647513261614),
 ('speed', 0.06624223924595612),
 ('eiffel', 0.06388773935086059),
 ('rental', 0.06382563840199988),
 ('family', 0.06302416151249232),
 ('heritage', 0.06288475584868775),
 ('discover', 0.06271979930868835),
 ('india', 0.06265113390420263),
 ('tapas', 0.0626020004541849),
 ('jungle', 0.06223544883355017),
 ('segway', 0.06206844346777614),
 ('cave', 0.060998525306394935),
 ('trek', 0.060611163169160404),
 ('palatine', 0.059602213550173665),
 ('samaria', 0.059491280261654594),
 ('snorkeling', 0.05913697620264416),
 ('ancient', 0.05847647403916104),
 ('canoe', 0.05844324014657467),
 ('access', 0.05739773107789663),
 ('terrace', 0.057348316825488396),
 ('premium', 0.05676393058285566),
 ('catamaran', 0.056529062263350034),
 ('street', 0.055801653547307234),
 ('coral', 0.05568516418976243),
 ('mosque', 0.05562354131056334),
 ('agia', 0.05538843334705773),
 ('town', 0.05491839335066291),
 ('hampi', 0.054443073458893744),
 ('sea', 0.05431132289147468),
 ('van', 0.05422884510474162),
 ('charter', 0.05356835821674179),
 ('exclusive', 0.05334157804062066),
 ('area', 0.05305583807708972),
 ('orly', 0.05295432556606403),
 ('east', 0.05263077743881874),
 ('service', 0.052468081038847064),
 ('villa', 0.052279251662202486),
 ('dhow', 0.05222867693867784),
 ('godfather', 0.05199132854333619),
 ('pickup', 0.05170536960980403),
 ('koh', 0.05114411103183775),
 ('seine', 0.05104695160838802),
 ('siracusa', 0.05083596568681761),
 ('charles', 0.05020375331104862),
 ('majorca', 0.05016077115871868),
 ('heaven', 0.05002661322870336),
 ('panaji', 0.0499061506706526),
 ('badami', 0.0499061506706526),
 ('evening', 0.049872590053577644),
 ('gaulle', 0.0498393652386485),
 ('valley', 0.049754145338676656),
 ('trapani', 0.04968060283029902),
 ('quarter', 0.04967018850816062),
 ('gate', 0.048739661196324884),
 ('diving', 0.04871450820978362),
 ('winery', 0.04871258953013535),
 ('balloon', 0.048689748931370416),
 ('red', 0.048659069294311116),
 ('electric', 0.048016348363087595),
 ('history', 0.0478688966508233),
 ('castelmola', 0.04736987711726186),
 ('thai', 0.047281654814117655),
 ('bus', 0.0472314089457581),
 ('balos', 0.04718273951786399),
 ('hill', 0.04682499889306569),
 ('pizza', 0.04675998968701412),
 ('ortigia', 0.04621451426074327),
 ('westminster', 0.046024341679845225),
 ('central', 0.04602109830388052),
 ('jeep', 0.045882930374400575),
 ('southampton', 0.045524077096368645),
 ('mumbai', 0.04536922788241145),
 ('agio', 0.04513131606056556),
 ('nikolaos', 0.04513131606056556),
 ('elafonisi', 0.04513131606056556),
 ('erice', 0.045059151404224694),
 ('di', 0.045030421168799975),
 ('photo', 0.04496484424693435),
 ('racha', 0.04490702432063802),
 ('scuba', 0.04473983651034698),
 ('trekking', 0.04463079173264452),
 ('cabaret', 0.04444957346472021),
 ('gaudi', 0.044206767237227024),
 ('amazing', 0.044131262512439685),
 ('pompeii', 0.04408459795594175),
 ('lempuyang', 0.04397417237383011),
 ('krabi', 0.043659606978398074),
 ('photographer', 0.043604911957418106),
 ('entrance', 0.04357972483757398),
 ('tasting', 0.04349845863260848),
 ('french', 0.04322463920243878),
 ('traditional', 0.04312749088040545),
 ('ciampino', 0.04312710710685803),
 ('santorini', 0.043079892603267125),
 ('olive', 0.042838507942500964),
 ('pick', 0.042352079612257953),
 ('ayung', 0.04234549932294751),
 ('secret', 0.042160115092809494),
 ('marina', 0.041850280971336515),
 ('basilica', 0.04172384711561687),
 ('alcantara', 0.04159306283466895),
 ('oxford', 0.04152196042855602),
 ('brava', 0.041469950165531844),
 ('kid', 0.041164909292155605),
 ('simon', 0.041164772293918184),
 ('disneyland', 0.04111747632188502),
 ('elounda', 0.041028469145968686),
 ('fishing', 0.041014668916903536),
 ('latin', 0.04088101975839714),
 ('business', 0.04086885127790192),
 ('panjim', 0.04083230509417031),
 ('al', 0.04081421908080346),
 ('dance', 0.04037619458885896),
 ('minivan', 0.040078512924272336),
 ('white', 0.03982496574285),
 ('kayak', 0.03965842837884587),
 ('sabiha', 0.039487996258686854),
 ('naxos', 0.03928233712163178),
 ('train', 0.03919079727840462),
 ('peters', 0.03898026988504475),
 ('gramvousa', 0.038977045688670255),
 ('sophia', 0.038898623180198995),
 ('balinese', 0.03874796563234792),
 ('fco', 0.03856558616286343),
 ('massage', 0.03824867811804962),
 ('bash', 0.03818096383103345),
 ('blue', 0.03811707512317787),
 ('snorkel', 0.038038703560186696),
 ('inclusive', 0.037869838798751825),
 ('hagia', 0.03771987702322327),
 ('culture', 0.03766777042461863),
 ('gatwick', 0.037519843760743396),
 ('professional', 0.03751062089960914),
 ('costa', 0.037509252828097674),
 ('spa', 0.03747454639684817),
 ('church', 0.03740464538399271),
 ('cook', 0.037222253376202614),
 ('coast', 0.037179360759407316),
 ('hike', 0.03704540766637057),
 ('west', 0.03694014797666641),
 ('champagne', 0.03688747297079222),
 ('drach', 0.0367845655163937),
 ('lagoon', 0.03657917327733942),
 ('garden', 0.036491897861613304),
 ('free', 0.03639551273010936),
 ('catacomb', 0.03632776261593659),
 ('live', 0.03630964750600018),
 ('divar', 0.03629538230592916),
 ('goan', 0.03629538230592916),
 ('hubli', 0.03629538230592916),
 ('giverny', 0.03613353979802016),
 ('gothic', 0.036065871309892424),
 ('sand', 0.035793190356862264),
 ('turkish', 0.03557801123052035),
 ('girona', 0.03536541378978162),
 ('bath', 0.03533154229084169),
 ('trail', 0.035326773497287446),
 ('la', 0.03530049944807213),
 ('ragusa', 0.03516066821992136),
 ('cycling', 0.035132044558901426),
 ('roundtrip', 0.03505622384048468),
 ('modica', 0.03488069957315188),
 ('hot', 0.03460725592318522),
 ('audio', 0.034570138647264284),
 ('yacht', 0.0344886050849099),
 ('royal', 0.034410432567374895),
 ('cultural', 0.034370591153695756),
 ('taxi', 0.034344473225900424),
 ('normandy', 0.03426456360157085),
 ('cathedral', 0.03411516801079608),
 ('platform', 0.03406945768545709),
 ('grand', 0.03406039266866741),
 ('yoga', 0.034056098128518064),
 ('escape', 0.03392033372166563),
 ('bcn', 0.03389185488187405),
 ('workshop', 0.03385788648222675),
 ('share', 0.033684368375287835),
 ('loire', 0.03364157153608774),
 ('tivoli', 0.033589381496687505),
 ('vehicle', 0.03346202831124101),
 ('pastilla', 0.03344051410581245),
 ('alcudia', 0.03344051410581245),
 ('harry', 0.033187143515708485),
 ('amalfi', 0.03317469777450618),
 ('florence', 0.0330444068292398),
 ('jet', 0.03278650133385952),
 ('islands', 0.032734500180503376),
 ('mountain', 0.032720803447477576),
 ('forest', 0.03263778273577361),
 ('oil', 0.03262133933118514),
 ('flight', 0.032610548601813685),
 ('combo', 0.032553721831980194),
 ('hanuman', 0.032432850898238566),
 ('raya', 0.032432850898238566),
 ('big', 0.03242329134666585),
 ('segesta', 0.03235015998252029),
 ('giardini', 0.03235015998252029),
 ('sport', 0.03206129241106248),
 ('ski', 0.03201540588592453),
 ('piazza', 0.03200484655459111),
 ('jatiluwih', 0.03175912449221063),
 ('baroque', 0.03164871135695556),
 ('admission', 0.031603060189895525),
 ('wildlife', 0.03156770542295556),
 ('plane', 0.03150954367175722),
 ('national', 0.03147364495490419),
 ('shopping', 0.031266441494389244),
 ('lake', 0.031242355141605324),
 ('gokcen', 0.031236773159856766),
 ('world', 0.03119544312866388),
 ('ebike', 0.031175512377756353),
 ('boarding', 0.031158165905174423),
 ('monkey', 0.031092333708787587),
 ('way', 0.030854973224726804),
 ('nature', 0.030773440452852616),
 ('option', 0.0306206728868477),
 ('archaeological', 0.030333348595859462),
 ('course', 0.03018232851905893),
 ('potter', 0.030179712267659302),
 ('kecak', 0.030130451441328036),
 ('soller', 0.030096462695231207),
 ('afternoon', 0.029967082281299892),
 ('san', 0.029947086263157154),
 ('maiton', 0.029938016213758682),
 ('rafting', 0.029704433410020173),
 ('buffet', 0.02957426930565492),
 ('sitge', 0.029471178158151347),
 ('summit', 0.029364732488997236),
 ('fun', 0.029261414675530676),
 ('ferrari', 0.029184880569244075),
 ('jewish', 0.029084369436987487),
 ('countryside', 0.029021104918073866),
 ('underground', 0.02898739827539399),
 ('capo', 0.028884071412964545),
 ('topkapi', 0.028879280845905315),
 ('sant', 0.028716195156984554),
 ('khao', 0.028690598871518733),
 ('photography', 0.02856771419268323),
 ('fountain', 0.02854131087086506),
 ('scooter', 0.02853900350633037),
 ('luton', 0.02851508125816498),
 ('optional', 0.02850405687736978),
 ('le', 0.028343606669344285),
 ('fast', 0.028215410158698776),
 ('sanctuary', 0.028185260431992338),
 ('drive', 0.028039082159793824),
 ('enjoy', 0.027911746436412015),
 ('english', 0.027869010418419406),
 ('transport', 0.027739829297729595),
 ('instagram', 0.027583463670841062),
 ('dolphin', 0.02757586750307247),
 ('viceversa', 0.027437696911307475),
 ('ory', 0.027411650881256678),
 ('zipline', 0.02735284131282196),
 ('floor', 0.02734087893314472),
 ('early', 0.027301378043777073),
 ('chauffeur', 0.027300346647991858),
 ('people', 0.02727561948697873),
 ('backwater', 0.027221536729446872),
 ('campal', 0.027221536729446872),
 ('extension', 0.027221536729446872),
 ('british', 0.027172281019610115),
 ('antalya', 0.027111161610441723),
 ('authentic', 0.027080230997124075),
 ('semiprivate', 0.027029628782543707),
 ('formentor', 0.02675241128464996),
 ('saint', 0.02674002076405535),
 ('spinalonga', 0.026668504944879646),
 ('rethymnon', 0.026668504944879646),
 ('expert', 0.026591558564762308),
 ('camp', 0.02655212904262562),
 ('abbey', 0.026457575237207733),
 ('virtual', 0.026334586134615288),
 ('arena', 0.026274488983351826),
 ('wifi', 0.02619664334154627),
 ('classic', 0.026046231452091395),
 ('pantheon', 0.025999529479694337),
 ('mt', 0.02598857505753268),
 ('spring', 0.02596484380266624),
 ('buggy', 0.02580078409072323),
 ('breakfast', 0.02574340987141861),
 ('view', 0.025686195237188302),
 ('daily', 0.02568507726195553),
 ('session', 0.02567060389975891),
 ('executive', 0.02564848067867078),
 ('entry', 0.025622382607345104),
 ('hr', 0.02559222973103913),
 ('armerina', 0.025417982843408803),
 ('troy', 0.02534304237497813),
 ('trastevere', 0.02529570705306096),
 ('historical', 0.025280722922046608),
 ('dxb', 0.025213844039361714),
 ('hong', 0.0249483468447989),
 ('del', 0.0249252738478327),
 ('new', 0.02490608104690042),
 ('dps', 0.024837264025959597),
 ('crater', 0.0246267055948928),
 ('paleochora', 0.024617081487581214),
 ('bamboo', 0.024486943973067728),
 ('join', 0.024451527540475),
 ('special', 0.02440672071202471),
 ('tofrom', 0.024353696055269304),
 ('parisian', 0.024296690553841146),
 ('orsay', 0.024296690553841146),
 ('ist', 0.024164296218002406),
 ('region', 0.02415730815631),
 ('hiking', 0.02413088923041322),
 ('dover', 0.024012700006875772),
 ('natural', 0.02400905780602498),
 ('lesson', 0.023998294172558006),
 ('el', 0.023725158698552806),
 ('hide', 0.023644369012952634),
 ('tuscany', 0.023636972164335652),
 ('kuta', 0.02361575923779765),
 ('gallipoli', 0.02357492313951454),
 ('medieval', 0.023529203129126024),
 ('cotswold', 0.02351243542339919),
 ('lhr', 0.02351243542339919),
 ('golden', 0.023508364361203578),
 ('max', 0.02345730534566172),
 ('minute', 0.02343373588133493),
 ('semi', 0.023431598135048595),
 ('playa', 0.023408359874068713),
 ('pollensa', 0.023408359874068713),
 ('romantic', 0.02334724196056604),
 ('house', 0.023294561323278805),
 ('historic', 0.023260221088809083),
 ('fly', 0.02317169580720217),
 ('rock', 0.02316384397034705),
 ('bursa', 0.02298555006102668),
 ('marsala', 0.02289766451779904),
 ('casa', 0.02286538495263162),
 ('christmas', 0.022736777618703848),
 ('fontainhas', 0.022684613941205724),
 ('blive', 0.022684613941205724),
 ('chandor', 0.022684613941205724),
 ('anshi', 0.022684613941205724),
 ('bijapur', 0.022684613941205724),
 ('elafonissi', 0.02256565803028278),
 ('chrissi', 0.02256565803028278),
 ('matala', 0.02256565803028278),
 ('game', 0.02255246963857006),
 ('round', 0.02245718930273968),
 ('yao', 0.02245351216031901),
 ('marais', 0.02242771435739183),
 ('travel', 0.02241168250440743),
 ('air', 0.0224059058801957),
 ('jimbaran', 0.022394254449635703),
 ('photoshoot', 0.02231250112559791),
 ('bazaar', 0.022074241210780096),
 ('bedugul', 0.021987086186915056),
 ('monastery', 0.021985125155153336),
 ('rent', 0.021980845927122822),
 ('ottoman', 0.021806803904050952),
 ('spot', 0.021754966152709467),
 ('pass', 0.021725075147765734),
 ('creek', 0.0216118663194529),
 ('telaga', 0.021579917924194402),
 ('trevi', 0.021563553553429014),
 ('orvieto', 0.021563553553429014),
 ('gallery', 0.021524989765783237),
 ('black', 0.02150572251345748),
 ('license', 0.021498500582066416),
 ('gourmet', 0.02147322867267327),
 ('architecture', 0.021439180069002917),
 ('shuttle', 0.02131055039240979),
 ('teacher', 0.021274748458748268),
 ('italian', 0.02124927551796026),
 ('lak', 0.021206094818079065),
 ('ngurah', 0.021172749661473755),
 ('denpasar', 0.021172749661473755),
 ('meal', 0.02112729110038304),
 ('cheese', 0.021042715075745604),
 ('greek', 0.020930408426972674),
 ('gaudí', 0.020929934479004667),
 ('western', 0.020928169403187043),
 ('station', 0.020867679735771),
 ('montalbano', 0.020796531417334476),
 ('dua', 0.020765581398753105),
 ('waja', 0.020765581398753105),
 ('tirta', 0.020765581398753105),
 ('maya', 0.02068206040153395),
 ('guell', 0.020629824710705943),
 ('plus', 0.02051851181067449),
 ('zeus', 0.020514234572984343),
 ('imbros', 0.020514234572984343),
 ('ammoudara', 0.020514234572984343),
 ('uncharted', 0.020514234572984343),
 ('tr', 0.020514234572984343),
 ('churchill', 0.02051084792253972),
 ('plateau', 0.020371888284401875),
 ('besakih', 0.020358413136032455),
 ('chef', 0.02035600518159244),
 ('hidden', 0.020194343576680575),
 ('show', 0.020159996919219065),
 ('surf', 0.020122451883731858),
 ('tramuntana', 0.02006430846348747),
 ('elm', 0.02006430846348747),
 ('open', 0.020000495865614882),
 ('land', 0.01998088387596459),
 ('yai', 0.01995867747583912),
 ('cepung', 0.019951244873311808),
 ('borghese', 0.019904818664703706),
 ('flamenco', 0.019893045256752163),
 ('express', 0.01983305696672992),
 ('sandboarde', 0.01981087745949849),
 ('zaye', 0.01981087745949849),
 ('castel', 0.019766333494922565),
 ('royalty', 0.019712236647783432),
 ('milazzo', 0.01964116856081589),
 ('rai', 0.019604380692448985),
 ('belly', 0.01954461391848378),
 ('discovery', 0.019483922107493626),
 ('cava', 0.019418293447087377),
 ('time', 0.019344129522923064),
 ('vineyard', 0.019333035608123196),
 ('musée', 0.019312754029976294),
 ('min', 0.019284186904590838),
 ('self', 0.019249635494090114),
 ('arab', 0.01924341885306312),
 ('treasure', 0.01916740852964921),
 ('güell', 0.019156265802798375),
 ('real', 0.019152648355388137),
 ('dining', 0.019103042479947107),
 ('beautiful', 0.01906790020790327),
 ('spanish', 0.019018874813346406),
 ('hunt', 0.019001478985758467),
 ('love', 0.019000690886551753),
 ('make', 0.018976161206243673),
 ('holiday', 0.018906935202763164),
 ('dday', 0.01890338636377391),
 ('selfdrive', 0.018782642588268075),
 ('layover', 0.018774870061444567),
 ('tukad', 0.01872974008514986),
 ('room', 0.01872558928669867),
 ('transportation', 0.018722679278871078),
 ('noi', 0.018711260133599177),
 ('similan', 0.018711260133599177),
 ('peter', 0.018660767498159724),
 ('deste', 0.018660767498159724),
 ('ostia', 0.018660767498159724),
 ('sight', 0.018599904567777416),
 ('santa', 0.01858442650050673),
 ('court', 0.018509789588633405),
 ('isola', 0.018485805704297312),
 ('mondello', 0.018485805704297312),
 ('lassithi', 0.01846281111568591),
 ('pelagia', 0.01846281111568591),
 ('malia', 0.01846281111568591),
 ('vise', 0.01846281111568591),
 ('dive', 0.01845430948189881),
 ('italy', 0.018445610595074487),
 ('andorra', 0.01841948634884459),
 ('fromto', 0.01832070765467877),
 ('fort', 0.018314851042293072),
 ('coffee', 0.01825207397359913),
 ('ghetto', 0.018246083775978396),
 ('triangle', 0.018187020366107164),
 ('square', 0.0181758342194032),
 ('mormugao', 0.01814769115296458),
 ('taj', 0.01814769115296458),
 ('mahal', 0.01814769115296458),
 ('delhi', 0.01814769115296458),
 ('agra', 0.01814769115296458),
 ('happen', 0.01814769115296458),
 ('pax', 0.018104142194384654),
 ('bird', 0.018092309249423986),
 ('club', 0.01805360874141225),
 ('paella', 0.018051346149936413),
 ('atlantis', 0.01802966970046264),
 ('sheikh', 0.018009888599544083),
 ('seminyak', 0.01791540355970856),
 ('site', 0.01785980951331849),
 ('allinclusive', 0.01780417124527856),
 ('sorrento', 0.017730593330911273),
 ('training', 0.0177297739504158),
 ('panoramic', 0.0177249147904827),
 ('learn', 0.01771306227598903),
 ('salt', 0.01762018289053501),
 ('district', 0.017611275715668142),
 ('wonder', 0.017545986437347807),
 ('centre', 0.017516624245185693),
 ('lo', 0.017500818468282022),
 ('magic', 0.017491381726585604),
 ('hkt', 0.01746384279135923),
 ('mont', 0.017443777833526974),
 ('dorsay', 0.017443777833526974),
 ('hopon', 0.017439194189880804),
 ('light', 0.017434890840410396),
 ('vacation', 0.017434234053073618),
 ('appian', 0.017416716331615744),
 ('modern', 0.017351710380898052),
 ('hopoff', 0.017270673413725818),
 ('reserve', 0.017204817313771853),
 ('cliff', 0.017199443031056156),
 ('watch', 0.017190499045382182),
 ('climb', 0.017147260756650493),
 ('paddle', 0.0171329936607975),
 ('track', 0.017129749857972484),
 ('accommodation', 0.017115681344602545),
 ('path', 0.01706889470616141),
 ('bicycle', 0.017008355729911925),
 ('sup', 0.016762141780420126),
 ('rover', 0.016741947090484953),
 ('coasteere', 0.016720257052906223),
 ('ratjada', 0.016720257052906223),
 ('vintage', 0.016702679656761237),
 ('point', 0.0166320389075146),
 ('hampton', 0.016508731254727092),
 ('unesco', 0.0164639188261267),
 ('kournas', 0.016411387658387476),
 ('rethimno', 0.016411387658387476),
 ('gouve', 0.016411387658387476),
 ('unique', 0.016373253441588367),
 ('cab', 0.016335067970030658),
 ('eye', 0.01631547302820369),
 ('activity', 0.016259141233176292),
 ('bella', 0.016231959870284567),
 ('buddha', 0.016216425449119283),
 ('patong', 0.016216425449119283),
 ('pier', 0.016212282246988344),
 ('montjuic', 0.016209147986983242),
 ('savoca', 0.016175079991260145),
 ('marzamemi', 0.016175079991260145),
 ('sciacca', 0.016175079991260145),
 ('ferry', 0.01615915410509366),
 ('beginner', 0.016129905802226424),
 ('chocolate', 0.016109988389541755),
 ('es', 0.01608263701729538),
 ('terminal', 0.016049023183108926),
 ('ultimate', 0.015965654822884093),
 ('byzantine', 0.015913073119172318),
 ('gem', 0.015891676425174188),
 ('lembongan', 0.015879562246105317),
 ('return', 0.015742724237188054),
 ('friendly', 0.015736360117334668),
 ('golf', 0.015730811235309353),
 ('vito', 0.015679968450108783),
 ('war', 0.015674474938955),
 ('distillery', 0.015665633011743462),
 ('fire', 0.015638362555112917),
 ('horseback', 0.015602820631799595),
 ('accessible', 0.015586036901843679),
 ('moulin', 0.015574801637077655),
 ('rouge', 0.015574801637077655),
 ('postcode', 0.01557218906284896),
 ('vespa', 0.015501308493360167),
 ('stadium', 0.015489650411158207),
 ('official', 0.015481942909508847),
 ('aquarium', 0.015419019693937133),
 ('drink', 0.015416103264018302),
 ('portuguese', 0.015378689708049067),
 ('heart', 0.015345561035386377),
 ('assisi', 0.015343297720709106),
 ('asian', 0.015323700040684451),
 ('southern', 0.015306240068714538),
 ('independent', 0.015225926270495358),
 ('frame', 0.015204552671009827),
 ('da', 0.01512711076285625),
 ('great', 0.015120153950915595),
 ('caltagirone', 0.015019717134741565),
 ('pauls', 0.015007937504297356),
 ('international', 0.015007720626694497),
 ('chalong', 0.014969008106879341),
 ('passenger', 0.01496255949035425),
 ('michel', 0.014951809571594551),
 ('customize', 0.014943419905635167),
 ('navona', 0.01492861399852778),
 ('pub', 0.014858015219226998),
 ('organic', 0.01475507262803898),
 ('anzac', 0.014734326962196587),
 ('canyon', 0.014705999679115114),
 ('vulcano', 0.014704315174487167),
 ('hop', 0.014700689164610023),
 ('dei', 0.014679783534705063),
 ('sailboat', 0.014651414553534445),
 ('music', 0.014640810101608778),
 ('shoot', 0.01452372898456263),
 ('roma', 0.014513930276346452),
 ('papal', 0.014513930276346452),
 ('naples', 0.014513930276346452),
 ('guard', 0.014507672920820778),
 ('pastry', 0.014498776450751347),
 ('kidfriendly', 0.014479114229654517),
 ('picasso', 0.014455637401636684),
 ('mediterranean', 0.014443363849582499),
 ('sail', 0.014404434618424885),
 ('christian', 0.014388838408842768),
 ('minoan', 0.014359964201089039),
 ('fodele', 0.014359964201089039),
 ('plakias', 0.014359964201089039),
 ('plaka', 0.014359964201089039),
 ('honeymoon', 0.014359610948204938),
 ('motorcycle', 0.014332294901814167),
 ('story', 0.014288105838256826),
 ('basis', 0.01428540775068169),
 ('horse', 0.014214215771240291),
 ('star', 0.014146584441392922),
 ('continent', 0.014144953883708726),
 ('dolmabahce', 0.014144953883708726),
 ('canal', 0.014139166570033994),
 ('selfguide', 0.014043990864652111),
 ('audioguide', 0.014018004849544866),
 ('banana', 0.01397908530091763),
 ('butterfly', 0.013877708411450099),
 ('stromboli', 0.013864354278222983),
 ('aeolian', 0.013864354278222983),
 ('danu', 0.01384372093250207),
 ('see', 0.013839077798923634),
 ('culinary', 0.013753857562744654),
 ('legend', 0.013747398698919513),
 ('coach', 0.013740872863852359),
 ('rum', 0.013721590764639396),
 ('kata', 0.013721590764639396),
 ('notre', 0.013705825440628339),
 ('wadi', 0.013687515335653502),
 ('sharjah', 0.013687515335653502),
 ('gladiator', 0.013684562831983799),
 ('location', 0.013676535681109235),
 ('vw', 0.013644006880137458),
 ('jack', 0.013627966129198843),
 ('vista', 0.013610768364723436),
 ('crocodile', 0.013610768364723436),
 ('margao', 0.013610768364723436),
 ('mollem', 0.013610768364723436),
 ('belgaum', 0.013610768364723436),
 ('chorao', 0.013610768364723436),
 ('chennai', 0.013610768364723436),
 ('hire', 0.013557037152026963),
 ('bodrum', 0.013555580805220862),
 ('like', 0.013534118794122228),
 ('price', 0.013508166080198843),
 ('offer', 0.01349245369538417),
 ('iconic', 0.013473497189326463),
 ('positano', 0.01345697095928471),
 ('fall', 0.013437253843065116),
 ('bellver', 0.01337620564232498),
 ('rafa', 0.01337620564232498),
 ('nadal', 0.01337620564232498),
 ('ibiza', 0.01337620564232498),
 ('cabrera', 0.01337620564232498),
 ('torrent', 0.01337620564232498),
 ('millor', 0.01337620564232498),
 ('magaluf', 0.01337620564232498),
 ('valldemossa', 0.01337620564232498),
 ('ponsa', 0.01337620564232498),
 ('drop', 0.013375846507514031),
 ('kayaking', 0.013371523748359952),
 ('pearl', 0.013363842868667025),
 ('luxe', 0.013285191079791429),
 ('siena', 0.01326987910980247),
 ('cia', 0.01326987910980247),
 ('john', 0.013263500507292832),
 ('nou', 0.013262030171168109),
 ('prat', 0.013262030171168109),
 ('monet', 0.013165020990740748),
 ('multi', 0.01312532452610784),
 ('marai', 0.013082833375145233),
 ('roissy', 0.013082833375145233),
 ('dame', 0.013082833375145233),
 ('route', 0.013073849780263283),
 ('place', 0.013035503572439674),
 ('miracle', 0.013032704092257246),
 ('ulun', 0.013029384407060772),
 ('thame', 0.013006879170391042),
 ('greenwich', 0.013006879170391042),
 ('soho', 0.013006879170391042),
 ('mosaic', 0.01299732980578283),
 ('paradise', 0.012951791065340853),
 ('factory', 0.012950698548042388),
 ('priority', 0.01294079890969059),
 ('helicopter', 0.012898099273347454),
 ('attraction', 0.012885866485949507),
 ('sedan', 0.012784294654896671),
 ('mini', 0.012776431251459405),
 ('favignana', 0.012708991421704402),
 ('pm', 0.012676918370656072),
 ('sekumpul', 0.012622216144340122),
 ('sanur', 0.012622216144340122),
 ('offroad', 0.012615434969756612),
 ('road', 0.012612913609362546),
 ('barcelonas', 0.012525250717214323),
 ('tarragona', 0.012525250717214323),
 ('boqueria', 0.012525250717214323),
 ('landmark', 0.012518597376843032),
 ('kensington', 0.012506614586914462),
 ('cambridge', 0.012506614586914462),
 ('lcy', 0.012506614586914462),
 ('june', 0.012474751331723311),
 ('grays', 0.01247417342239945),
 ('maithon', 0.01247417342239945),
 ('bahtra', 0.01247417342239945),
 ('fontainebleau', 0.012459841309662125),
 ('spiritual', 0.012425432181484038),
 ('sapanca', 0.012376834648245134),
 ('sfakia', 0.012308540743790607),
 ('seabybus', 0.012308540743790607),
 ('palekastro', 0.012308540743790607),
 ('sissi', 0.012308540743790607),
 ('kalo', 0.012308540743790607),
 ('chorio', 0.012308540743790607),
 ('eloúnda', 0.012308540743790607),
 ('ruin', 0.01230610651040318),
 ('long', 0.01227764948412368),
 ('custom', 0.012249062203889365),
 ('tegalalang', 0.012215047881619475),
 ('canggu', 0.012215047881619475),
 ('sic', 0.012166207690899298),
 ('mercede', 0.012153675347718699),
 ('overnight', 0.012135177992892435),
 ('france', 0.012103125594661061),
 ('high', 0.012075501242374297),
 ('daytrip', 0.012065331874068708),
 ('demo', 0.012059201606218919),
 ('mystery', 0.012046902938730723),
 ('rediscover', 0.012045221173918172),
 ('entertainment', 0.012004878364804562),
 ('commercial', 0.011998428006977922),
 ('beauty', 0.01198648640559483),
 ('exploration', 0.011938452361506785),
 ('chantilly', 0.011836849244179019),
 ('greece', 0.011826897404699225),
 ('gelato', 0.011820439389438777),
 ('empul', 0.011807879618898825),
 ('near', 0.01179772549832075),
 ('crawl', 0.011796472043971098),
 ('figuere', 0.01178847126326054),
 ('horn', 0.01178746156975727),
 ('riding', 0.011733827922077084),
 ('spain', 0.011730476746687209),
 ('personalize', 0.01172370655162294),
 ('tree', 0.011718551490620059),
 ('whitewater', 0.011717499533321006),
 ('archaeologist', 0.011710211158179433),
 ('body', 0.011708197431080506),
 ('instagramable', 0.011589340334595315),
 ('cellar', 0.011582427098899317),
 ('dia', 0.01156295681060128),
 ('withlocal', 0.011557159011252599),
 ('casale', 0.011553628565185818),
 ('lipari', 0.011553628565185818),
 ('madonie', 0.011553628565185818),
 ('turchi', 0.011553628565185818),
 ('selinunte', 0.011553628565185818),
 ('online', 0.011533110469299761),
 ('tea', 0.011525701267122159),
 ('beatle', 0.011506085419961306),
 ('tate', 0.011506085419961306),
 ('shakespeare', 0.011506085419961306),
 ('europe', 0.011485480957296521),
 ('relic', 0.01148521064681669),
 ('hrs', 0.011467149004892154),
 ('outlet', 0.011402457919368452),
 ('tegenungan', 0.011400711356178175),
 ('gangga', 0.011400711356178175),
 ('aperitivo', 0.011295462216776587),
 ('destination', 0.01129450488621669),
 ('ghost', 0.011277303406638226),
 ('cocktail', 0.011264735503000873),
 ('tail', 0.011226756080159505),
 ('starlight', 0.011226756080159505),
 ('try', 0.011209941419126792),
 ('stanbul', 0.011198088491269407),
 ('kusadasi', 0.011198088491269407),
 ('cart', 0.011196460498895834),
 ('frascati', 0.011196460498895834),
 ('wild', 0.011158722572539935),
 ('picnic', 0.011120878510568059),
 ('palm', 0.011111510244773628),
 ('farmhouse', 0.011070636680554592),
 ('person', 0.011067676977329681),
 ('warner', 0.011067396954274344),
 ('dali', 0.011051691809306756),
 ('madrid', 0.011051691809306756),
 ('ripper', 0.011005820836484728),
 ('buckingham', 0.011005820836484728),
 ('cruiser', 0.011003058871761048),
 ('jetski', 0.011002634914606183),
 ('tibumana', 0.010993543093457528),
 ('taman', 0.010993543093457528),
 ('movie', 0.010989444272929982),
 ('venice', 0.010978047280266615),
 ('prince', 0.010953889278735038),
 ('beer', 0.010952011759077665),
 ('personal', 0.010936067063240667),
 ('idyllic', 0.010926852966196263),
 ('start', 0.010875113137143113),
 ('treatment', 0.010865427287371042),
 ('holy', 0.01079802426574954),
 ('padi', 0.010761472780501675),
 ('life', 0.010739721591496203),
 ('journey', 0.010729606002446518),
 ('minibus', 0.010692150893081103),
 ('wonderful', 0.0106697819783578),
 ('bros', 0.010637763918838746),
 ('istanbuls', 0.010608715412781545),
 ('immersive', 0.01060373247381574),
 ('easy', 0.010603340871580194),
 ('diver', 0.010597010503850118),
 ('reim', 0.010590865113212807),
 ('cabbelair', 0.010590865113212807),
 ('magical', 0.010582430611515906),
 ('change', 0.01058092148251886),
 ('dalí', 0.010561386949092635),
 ('global', 0.010557967846694867),
 ('lgw', 0.01050555625300815),
 ('stanste', 0.01050555625300815),
 ('emirate', 0.010500400005951024),
 ('escooter', 0.010463207200873275),
 ('restaurant', 0.010459422453866149),
 ('king', 0.010438223650278052),
 ('cefalu', 0.010398265708667238),
 ('saline', 0.010398265708667238),
 ('comiso', 0.010398265708667238),
 ('genealogy', 0.010398265708667238),
 ('ancestry', 0.010398265708667238),
 ('record', 0.010398265708667238),
 ('research', 0.010398265708667238),
 ('incl', 0.010397785275503281),
 ('cesarina', 0.010379389725602102),
 ('pisa', 0.01036709305453318),
 ...]

In contrast, there is not as much overlap with these words as in the count vectorization because tf-idf vectorization is finding more words that are unique to the cities. This tells us that tf-idf vectorization is probably a better vectorization technique to use while modeling in order to best predict the cities.

Word Clouds with Bi-Grams¶

In [38]:
cv2 = CountVectorizer(analyzer='word', stop_words=stopwords_list, ngram_range=(2,2))
data3 = cv2.fit_transform(df_grouped['lemmatized'])
df_dtm3 = pd.DataFrame(data3.toarray(), columns=cv2.get_feature_names_out())
df_dtm3.index = df_grouped.index
df_dtm3
Out[38]:
aal deep abandon ghost abandon hotel abandon village abant yedigoller abba super abbate arrival abbey avebury abbey banquet abbey buckingham ... روما إلى فورميا العكس كاستيلوا العكس مدينة بوجيوا مدينة تشيتا مدينة روما مدينة فورميا من مدينة ميرتيتوا العكس نقل خصوصي
City
Bali, Indonesia 0 1 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Barcelona, Spain 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Crete, Greece 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Dubai, United Arab Emirates 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Goa, India 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Istanbul, Turkey 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
London, United Kingdom 0 0 0 0 0 1 0 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
Majorca, Balearic Islands 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Paris, France 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Phuket, Thailand 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Rome, Italy 0 0 0 0 0 0 0 0 0 0 ... 3 1 1 1 1 3 1 3 1 3
Sicily, Italy 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

12 rows × 57339 columns

In [39]:
# Transposing document term matrix
df_dtm3 = df_dtm3.transpose()

# Plotting word cloud for each city
for index, city in enumerate(df_dtm3.columns):
    generate_wordcloud(df_dtm3[city].sort_values(ascending=False), city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [40]:
# Look at top bi-grams (in total, not per city)
sum_words = data3.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in cv2.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
words_freq
Out[40]:
[('private tour', 2228),
 ('private transfer', 1569),
 ('day tour', 824),
 ('city tour', 818),
 ('tour private', 813),
 ('desert safari', 765),
 ('guide tour', 625),
 ('walk tour', 618),
 ('day trip', 589),
 ('small group', 573),
 ('tour rome', 562),
 ('airport transfer', 549),
 ('skip line', 513),
 ('half day', 465),
 ('abu dhabi', 426),
 ('tour london', 424),
 ('tour istanbul', 392),
 ('private day', 386),
 ('tour bali', 350),
 ('sistine chapel', 350),
 ('dubai city', 346),
 ('tour dubai', 344),
 ('vatican museum', 311),
 ('walking tour', 307),
 ('tour day', 304),
 ('food tour', 295),
 ('tour paris', 294),
 ('quad bike', 292),
 ('airport private', 291),
 ('day private', 290),
 ('private guide', 288),
 ('bbq dinner', 285),
 ('group tour', 268),
 ('tanah lot', 259),
 ('cooking class', 256),
 ('camel ride', 252),
 ('rome private', 250),
 ('transfer rome', 249),
 ('vice versa', 236),
 ('shore excursion', 230),
 ('roman forum', 230),
 ('london private', 223),
 ('wine taste', 207),
 ('sightseeing tour', 206),
 ('tour barcelona', 206),
 ('nusa penida', 201),
 ('rome city', 198),
 ('mount batur', 192),
 ('paris private', 190),
 ('arrival transfer', 188),
 ('heathrow airport', 185),
 ('museum sistine', 185),
 ('bike tour', 183),
 ('burj khalifa', 182),
 ('transfer private', 179),
 ('safari dubai', 178),
 ('rice terrace', 177),
 ('transfer paris', 177),
 ('transfer dubai', 174),
 ('red dune', 173),
 ('city center', 172),
 ('fullday tour', 168),
 ('departure transfer', 167),
 ('phi phi', 167),
 ('private walking', 164),
 ('cesarinas home', 163),
 ('bali tour', 160),
 ('dhabi city', 160),
 ('bali private', 159),
 ('transfer london', 157),
 ('private car', 155),
 ('street food', 155),
 ('dubai desert', 153),
 ('tour include', 151),
 ('barcelona private', 151),
 ('private airport', 149),
 ('tiramisu class', 148),
 ('private arrival', 146),
 ('dinner cruise', 145),
 ('local guide', 144),
 ('ancient rome', 143),
 ('safari bbq', 140),
 ('colosseum roman', 140),
 ('batur sunrise', 138),
 ('palatine hill', 138),
 ('dubai private', 137),
 ('tour lunch', 136),
 ('tour skip', 136),
 ('sagrada familia', 136),
 ('central london', 136),
 ('class cesarinas', 136),
 ('tour local', 135),
 ('tour guide', 134),
 ('white water', 133),
 ('ubud tour', 133),
 ('tour colosseum', 133),
 ('tour vatican', 132),
 ('fiumicino airport', 131),
 ('paris city', 130),
 ('home taste', 130),
 ('civitavecchia port', 129),
 ('sunrise trek', 126),
 ('private departure', 125),
 ('tour ubud', 125),
 ('london city', 125),
 ('transfer palermo', 124),
 ('cruise port', 121),
 ('tour good', 120),
 ('istanbul airport', 119),
 ('transfer barcelona', 114),
 ('louvre museum', 114),
 ('eiffel tower', 114),
 ('day istanbul', 113),
 ('rome day', 113),
 ('gate heaven', 112),
 ('morning desert', 112),
 ('forum palatine', 112),
 ('transfer airport', 111),
 ('istanbul day', 109),
 ('palermo airport', 109),
 ('sand boarding', 108),
 ('water raft', 107),
 ('car charter', 107),
 ('dhow cruise', 107),
 ('arrival private', 106),
 ('evening desert', 105),
 ('hotel pickup', 104),
 ('barcelona airport', 104),
 ('versa private', 104),
 ('london london', 103),
 ('james bond', 103),
 ('rome rome', 103),
 ('bali airport', 102),
 ('market tour', 102),
 ('dubai airport', 102),
 ('fullday private', 101),
 ('luxury van', 101),
 ('dune bash', 101),
 ('wine tour', 100),
 ('dubai dubai', 100),
 ('paris airport', 100),
 ('private dubai', 99),
 ('turkey tour', 99),
 ('transfer catania', 99),
 ('bali day', 98),
 ('speed boat', 98),
 ('temple tour', 97),
 ('departure private', 97),
 ('old city', 97),
 ('windsor castle', 97),
 ('paris paris', 97),
 ('entrance ticket', 96),
 ('cruise dinner', 96),
 ('pasta tiramisu', 96),
 ('dubai marina', 95),
 ('sunset tour', 93),
 ('round trip', 93),
 ('lot temple', 93),
 ('guide private', 93),
 ('tour abu', 93),
 ('admission ticket', 92),
 ('private fullday', 92),
 ('segway tour', 92),
 ('business car', 92),
 ('trip rome', 92),
 ('tour package', 91),
 ('london heathrow', 91),
 ('rome tour', 91),
 ('sitia jsh', 90),
 ('jsh airport', 90),
 ('dinner dubai', 90),
 ('river cruise', 90),
 ('hot spring', 89),
 ('island tour', 89),
 ('istanbul cappadocia', 89),
 ('monkey forest', 88),
 ('uluwatu temple', 88),
 ('tour hour', 88),
 ('day night', 88),
 ('night tour', 88),
 ('kid family', 88),
 ('st peters', 88),
 ('private driver', 86),
 ('istanbul private', 86),
 ('charles de', 84),
 ('jungle swing', 83),
 ('food wine', 83),
 ('heraklion airport', 83),
 ('tour phuket', 83),
 ('dune desert', 82),
 ('rome night', 82),
 ('airport rome', 82),
 ('rome colosseum', 81),
 ('blue lagoon', 80),
 ('instagram tour', 80),
 ('shopping tour', 80),
 ('barcelona city', 80),
 ('safari camel', 80),
 ('de gaulle', 80),
 ('airport cdg', 80),
 ('tour fullday', 79),
 ('hotel pick', 79),
 ('ephesus pamukkale', 79),
 ('free wifi', 78),
 ('istanbul tour', 78),
 ('chapel st', 78),
 ('atv ride', 77),
 ('semi private', 77),
 ('tour hotel', 77),
 ('tour kid', 77),
 ('dubai tour', 77),
 ('transfer fiumicino', 77),
 ('ayung river', 76),
 ('seine river', 76),
 ('de mallorca', 76),
 ('good rome', 76),
 ('bosphorus cruise', 75),
 ('cappadocia tour', 75),
 ('transfer heathrow', 75),
 ('ciampino airport', 75),
 ('halfday tour', 74),
 ('luxury car', 74),
 ('hour private', 74),
 ('boat tour', 74),
 ('ferrari world', 74),
 ('day cappadocia', 74),
 ('bali swing', 73),
 ('tour explore', 73),
 ('safari quad', 73),
 ('city airport', 72),
 ('palma de', 72),
 ('cook class', 71),
 ('crete private', 71),
 ('share transfer', 71),
 ('istanbul city', 71),
 ('trip paris', 71),
 ('smallgroup tour', 70),
 ('harry potter', 70),
 ('rome airport', 69),
 ('transfer hotel', 68),
 ('airport hotel', 68),
 ('waterfall tour', 68),
 ('line ticket', 68),
 ('private vehicle', 68),
 ('balloon ride', 68),
 ('old town', 68),
 ('private pizza', 68),
 ('catania cta', 68),
 ('cta airport', 68),
 ('hotel transfer', 67),
 ('local photographer', 67),
 ('private pasta', 67),
 ('pizza tiramisu', 67),
 ('east bali', 66),
 ('water park', 66),
 ('atv quad', 66),
 ('day dubai', 66),
 ('smallgroup street', 66),
 ('audio tour', 65),
 ('tour burj', 65),
 ('tower london', 65),
 ('phi island', 65),
 ('peters basilica', 65),
 ('jet ski', 64),
 ('highlight tour', 64),
 ('tour half', 64),
 ('lot sunset', 64),
 ('transfer istanbul', 64),
 ('gatwick airport', 64),
 ('rome vatican', 64),
 ('amalfi coast', 64),
 ('good ubud', 63),
 ('kintamani volcano', 63),
 ('local private', 63),
 ('istanbul istanbul', 63),
 ('mallorca airport', 63),
 ('phang nga', 63),
 ('transfer bali', 62),
 ('barcelona barcelona', 62),
 ('costa brava', 62),
 ('taste tour', 62),
 ('tour skiptheline', 62),
 ('roundtrip transfer', 61),
 ('creta private', 61),
 ('hagia sophia', 61),
 ('etna taormina', 61),
 ('private bali', 60),
 ('ticket private', 60),
 ('private shore', 60),
 ('fast track', 60),
 ('burj al', 60),
 ('colosseum tour', 60),
 ('hour tour', 59),
 ('private photo', 59),
 ('session local', 59),
 ('private london', 59),
 ('bali car', 58),
 ('airport pmi', 58),
 ('orly airport', 58),
 ('lempuyang temple', 57),
 ('experience private', 57),
 ('al arab', 57),
 ('airport dxb', 57),
 ('airport london', 57),
 ('airport fco', 57),
 ('hotel private', 56),
 ('transfer tofrom', 56),
 ('grand mosque', 56),
 ('bond island', 56),
 ('line colosseum', 56),
 ('bali bali', 55),
 ('transfer service', 55),
 ('scuba diving', 55),
 ('private half', 55),
 ('tour small', 55),
 ('hot air', 55),
 ('air balloon', 55),
 ('museum private', 55),
 ('ride bbq', 55),
 ('trip london', 55),
 ('vatican sistine', 55),
 ('vatican city', 55),
 ('batur volcano', 54),
 ('hopon hopoff', 54),
 ('skiptheline ticket', 54),
 ('gaulle airport', 54),
 ('phuket city', 54),
 ('include lunch', 53),
 ('dinner private', 53),
 ('tour visit', 53),
 ('transfer heraklion', 53),
 ('dubai abu', 53),
 ('nga bay', 53),
 ('catania airport', 53),
 ('ngurah rai', 52),
 ('tour transfer', 52),
 ('chania chq', 52),
 ('chq airport', 52),
 ('belly dance', 52),
 ('dune buggy', 52),
 ('sabiha gokcen', 52),
 ('luton airport', 52),
 ('colosseum ancient', 52),
 ('tour palermo', 52),
 ('palermo pmo', 52),
 ('pmo airport', 52),
 ('bali atv', 51),
 ('raft adventure', 51),
 ('telaga waja', 51),
 ('island day', 51),
 ('volcano tour', 51),
 ('private trip', 51),
 ('tour night', 51),
 ('live show', 51),
 ('lunch private', 50),
 ('penida island', 50),
 ('package day', 50),
 ('cruise dubai', 50),
 ('gokcen airport', 50),
 ('cappadocia ephesus', 50),
 ('london airport', 50),
 ('cdg paris', 50),
 ('rome walk', 50),
 ('nusa dua', 49),
 ('virtual tour', 49),
 ('bali ubud', 49),
 ('car private', 49),
 ('private luxury', 49),
 ('private cooking', 49),
 ('sheikh zaye', 49),
 ('vatican tour', 49),
 ('day bali', 48),
 ('driver private', 48),
 ('gothic quarter', 48),
 ('tour city', 48),
 ('street art', 48),
 ('transfer sitia', 48),
 ('istanbul old', 48),
 ('airport paris', 48),
 ('local home', 48),
 ('bali good', 47),
 ('port private', 47),
 ('private city', 47),
 ('tour vip', 47),
 ('dubai day', 47),
 ('dubai morning', 47),
 ('tour desert', 47),
 ('istanbul bosphorus', 47),
 ('phuket airport', 47),
 ('transfer civitavecchia', 47),
 ('tour catania', 47),
 ('tour taormina', 47),
 ('good bali', 46),
 ('airport arrival', 46),
 ('cycling tour', 46),
 ('car tour', 46),
 ('heritage tour', 46),
 ('photo session', 46),
 ('tour tour', 46),
 ('night private', 46),
 ('tour professional', 46),
 ('blue mosque', 46),
 ('tour louvre', 46),
 ('topkapi palace', 46),
 ('london tour', 46),
 ('rome civitavecchia', 46),
 ('tour etna', 46),
 ('uluwatu sunset', 45),
 ('river raft', 45),
 ('cruise terminal', 45),
 ('tour ferrari', 45),
 ('transfer chania', 45),
 ('dhabi tour', 45),
 ('private istanbul', 45),
 ('villa deste', 45),
 ('temple bali', 44),
 ('ubud bali', 44),
 ('ubud private', 44),
 ('international airport', 44),
 ('museum tour', 44),
 ('experience barcelona', 44),
 ('ticket transfer', 44),
 ('dinner live', 44),
 ('dubai frame', 44),
 ('phuket phuket', 44),
 ('port rome', 44),
 ('palermo private', 44),
 ('tukad cepung', 43),
 ('natural hot', 43),
 ('halfday private', 43),
 ('jatiluwih rice', 43),
 ('adventure bali', 43),
 ('tour secret', 43),
 ('photo shoot', 43),
 ('ebike tour', 43),
 ('city centre', 43),
 ('airport central', 43),
 ('trevi fountain', 43),
 ('north bali', 42),
 ('adventure tour', 42),
 ('tour uluwatu', 42),
 ('waterfall bali', 42),
 ('water sport', 42),
 ('fullday bali', 42),
 ('city private', 42),
 ('southampton cruise', 42),
 ('transfer phuket', 42),
 ('rome hotel', 42),
 ('bali instagram', 41),
 ('mt batur', 41),
 ('ubud day', 41),
 ('tour tanah', 41),
 ('photo tour', 41),
 ('city sightseeing', 41),
 ('tour wine', 41),
 ('transfer central', 41),
 ('dubai evening', 41),
 ('loire valley', 41),
 ('private rome', 41),
 ('bali quad', 40),
 ('tour nusa', 40),
 ('photography tour', 40),
 ('highlight private', 40),
 ('trip bali', 40),
 ('lunch include', 40),
 ('lunch dinner', 40),
 ('taxi tour', 40),
 ('bus tour', 40),
 ('boat trip', 40),
 ('dining experience', 40),
 ('latin quarter', 40),
 ('british museum', 40),
 ('day rome', 40),
 ('package bali', 39),
 ('tour mount', 39),
 ('make class', 39),
 ('private group', 39),
 ('taste private', 39),
 ('tour ancient', 39),
 ('dune safari', 39),
 ('westminster abbey', 39),
 ('paris orly', 39),
 ('st peter', 39),
 ('colosseum forum', 39),
 ('private vatican', 39),
 ('colosseum vatican', 39),
 ('mount etna', 39),
 ('trip transfer', 38),
 ('airport dps', 38),
 ('buffet dinner', 38),
 ('tour discover', 38),
 ('line private', 38),
 ('private minivan', 38),
 ('ticket dubai', 38),
 ('ride dubai', 38),
 ('cappadocia pamukkale', 38),
 ('london day', 38),
 ('coral island', 38),
 ('valley temple', 38),
 ('water temple', 37),
 ('expert guide', 37),
 ('family tour', 37),
 ('miracle garden', 37),
 ('khalifa ticket', 37),
 ('ride option', 37),
 ('london walk', 37),
 ('tofrom london', 37),
 ('paris charles', 37),
 ('speedboat phuket', 37),
 ('chapel tour', 37),
 ('civitavecchia cruise', 37),
 ('port civitavecchia', 37),
 ('kecak dance', 36),
 ('beach tour', 36),
 ('english speaking', 36),
 ('early morning', 36),
 ('atv bike', 36),
 ('local expert', 36),
 ('trip barcelona', 36),
 ('horseback ride', 36),
 ('city luxury', 36),
 ('entry ticket', 36),
 ('airport creta', 36),
 ('ride sand', 36),
 ('layover tour', 36),
 ('day turkey', 36),
 ('tour turkey', 36),
 ('jewish ghetto', 36),
 ('rome fiumicino', 36),
 ('rome highlight', 36),
 ('transfer ciampino', 36),
 ('rome florence', 36),
 ('taormina private', 36),
 ('bike adventure', 35),
 ('lunch bali', 35),
 ('temple private', 35),
 ('palace private', 35),
 ('electric bike', 35),
 ('kintamani tour', 35),
 ('tour royal', 35),
 ('private sightseeing', 35),
 ('tour luxury', 35),
 ('luxury private', 35),
 ('group day', 35),
 ('private barcelona', 35),
 ('city highlight', 35),
 ('treasure hunt', 35),
 ('city heraklion', 35),
 ('dhabi dubai', 35),
 ('guide walk', 35),
 ('pamukkale tour', 35),
 ('london postcode', 35),
 ('khai island', 35),
 ('skiptheline colosseum', 35),
 ('line vatican', 35),
 ('penida tour', 34),
 ('temple sunset', 34),
 ('cultural tour', 34),
 ('night day', 34),
 ('private vacation', 34),
 ('luxury vehicle', 34),
 ('private walk', 34),
 ('safari tour', 34),
 ('dubai red', 34),
 ('private abu', 34),
 ('london gatwick', 34),
 ('castle private', 34),
 ('hotel rome', 34),
 ('excursion civitavecchia', 34),
 ('bali fullday', 33),
 ('tour highlight', 33),
 ('ubud village', 33),
 ('village tour', 33),
 ('ubud kintamani', 33),
 ('speaking driver', 33),
 ('pick drop', 33),
 ('class lunch', 33),
 ('tour exclusive', 33),
 ('cruise private', 33),
 ('boarding camel', 33),
 ('wild wadi', 33),
 ('safari dune', 33),
 ('trip istanbul', 33),
 ('pamukkale ephesus', 33),
 ('london southampton', 33),
 ('airport lhr', 33),
 ('hampton court', 33),
 ('paris tour', 33),
 ('group market', 33),
 ('taormina castelmola', 33),
 ('jeep tour', 32),
 ('fire dance', 32),
 ('evening tour', 32),
 ('water rafting', 32),
 ('ulun danu', 32),
 ('tour amazing', 32),
 ('audio guide', 32),
 ('vip private', 32),
 ('dubai creek', 32),
 ('global village', 32),
 ('dubai hotel', 32),
 ('black cab', 32),
 ('tour speed', 32),
 ('appian way', 32),
 ('rome center', 32),
 ('tour experience', 31),
 ('buffet lunch', 31),
 ('visit private', 31),
 ('tour free', 31),
 ('guide day', 31),
 ('way private', 31),
 ('photography session', 31),
 ('food taste', 31),
 ('tour roman', 31),
 ('museum guide', 31),
 ('olive oil', 31),
 ('knossos palace', 31),
 ('tour chania', 31),
 ('dinner belly', 31),
 ('dubai night', 31),
 ('dhow dinner', 31),
 ('istanbul ephesus', 31),
 ('airport ist', 31),
 ('city london', 31),
 ('london windsor', 31),
 ('london luton', 31),
 ('hill roman', 31),
 ('arena floor', 31),
 ('tour blue', 30),
 ('afternoon tour', 30),
 ('ubud waterfall', 30),
 ('day car', 30),
 ('tour sunset', 30),
 ('trip private', 30),
 ('private ubud', 30),
 ('tour east', 30),
 ('private food', 30),
 ('morning tour', 30),
 ('minute private', 30),
 ('private chauffeur', 30),
 ('city dubai', 30),
 ('ephesus tour', 30),
 ('tour semiprivate', 30),
 ('fast access', 30),
 ('disneyland paris', 30),
 ('colosseum underground', 30),
 ('colosseum arena', 30),
 ('tour sistine', 30),
 ('lagoon snorkeling', 29),
 ('self drive', 29),
 ('coffee plantation', 29),
 ('tour driver', 29),
 ('private round', 29),
 ('explore tour', 29),
 ('include private', 29),
 ('like local', 29),
 ('safari private', 29),
 ('vacation photography', 29),
 ('tour sagrada', 29),
 ('airport city', 29),
 ('transfer luxury', 29),
 ('private personalize', 29),
 ('modern dubai', 29),
 ('safe private', 29),
 ('day abu', 29),
 ('tour bosphorus', 29),
 ('stonehenge bath', 29),
 ('private paris', 29),
 ('paris night', 29),
 ('simon cabaret', 29),
 ('rome skip', 29),
 ('chapel vatican', 29),
 ('borghese gallery', 29),
 ('fullday rome', 29),
 ('open water', 28),
 ('kecak fire', 28),
 ('river rafting', 28),
 ('ticket include', 28),
 ('private evening', 28),
 ('arrival departure', 28),
 ('bike ride', 28),
 ('tour smallgroup', 28),
 ('ticket day', 28),
 ('tour speedboat', 28),
 ('ride private', 28),
 ('private sailing', 28),
 ('combo tour', 28),
 ('tour cooking', 28),
 ('wine tasting', 28),
 ('tour old', 28),
 ('vehicle private', 28),
 ('roundtrip private', 28),
 ('experience dubai', 28),
 ('zaye grand', 28),
 ('private basis', 28),
 ('new airport', 28),
 ('grand bazaar', 28),
 ('st pauls', 28),
 ('court palace', 28),
 ('paris day', 28),
 ('taste rome', 28),
 ('giardini naxos', 28),
 ('raft bali', 27),
 ('private halfday', 27),
 ('sunset dinner', 27),
 ('trip lunch', 27),
 ('rice field', 27),
 ('lot tour', 27),
 ('bali mount', 27),
 ('tour pickup', 27),
 ('tour walk', 27),
 ('hotel airport', 27),
 ('culture tour', 27),
 ('market visit', 27),
 ('tour optional', 27),
 ('de la', 27),
 ('day guide', 27),
 ('transfer city', 27),
 ('barcelona tour', 27),
 ('tour option', 27),
 ('area customer', 27),
 ('khalifa floor', 27),
 ('dinner camel', 27),
 ('zaye mosque', 27),
 ('tour st', 27),
 ('luxury transfer', 27),
 ('orsay museum', 27),
 ('cdg airport', 27),
 ('versaille palace', 27),
 ('big boat', 27),
 ('excursion rome', 27),
 ('chapel private', 27),
 ('tour messina', 27),
 ('sekumpul waterfall', 26),
 ('experience bali', 26),
 ('besakih temple', 26),
 ('balinese cooking', 26),
 ('cepung waterfall', 26),
 ('tirta gangga', 26),
 ('horse ride', 26),
 ('bali waterfall', 26),
 ('tour english', 26),
 ('luxury yacht', 26),
 ('group private', 26),
 ('walk private', 26),
 ('tour morning', 26),
 ('barcelona highlight', 26),
 ('airport bcn', 26),
 ('private wine', 26),
 ('city business', 26),
 ('city chania', 26),
 ('smallgroup day', 26),
 ('bosphorus dinner', 26),
 ('cappadocia day', 26),
 ('airport istanbul', 26),
 ('london eye', 26),
 ('pauls cathedral', 26),
 ('black taxi', 26),
 ('royal london', 26),
 ('city paris', 26),
 ('sea canoe', 26),
 ('golf cart', 26),
 ('rome small', 26),
 ('vatican museums', 26),
 ('tour civitavecchia', 26),
 ('group pasta', 26),
 ('colosseum guide', 26),
 ('colosseum private', 26),
 ('palermo monreale', 26),
 ('snorkeling tour', 25),
 ('private allinclusive', 25),
 ('tour halfday', 25),
 ('raft ubud', 25),
 ('tour enjoy', 25),
 ('private family', 25),
 ('beach private', 25),
 ('tirta empul', 25),
 ('hide gem', 25),
 ('private custom', 25),
 ('pickup private', 25),
 ('tour fast', 25),
 ('island private', 25),
 ('transfer fromto', 25),
 ('island speedboat', 25),
 ('history tour', 25),
 ('guide walking', 25),
 ('escape game', 25),
 ('samaria gorge', 25),
 ('sandboarde camel', 25),
 ('world adventure', 25),
 ('cruise tour', 25),
 ('marina dhow', 25),
 ('warner bros', 25),
 ('tour cappadocia', 25),
 ('ephesus cappadocia', 25),
 ('airport see', 25),
 ('churchill war', 25),
 ('war room', 25),
 ('tour palma', 25),
 ('transfer palma', 25),
 ('moulin rouge', 25),
 ('airport ory', 25),
 ('boat phuket', 25),
 ('vatican colosseum', 25),
 ('rome local', 25),
 ('civitavecchia rome', 25),
 ('ubud jungle', 24),
 ('waja river', 24),
 ('service private', 24),
 ('tour car', 24),
 ('tour kintamani', 24),
 ('ubud tanah', 24),
 ('art tour', 24),
 ('driver bali', 24),
 ('art village', 24),
 ('spa package', 24),
 ('professional guide', 24),
 ('island snorkeling', 24),
 ('seafood dinner', 24),
 ('private excursion', 24),
 ('airport departure', 24),
 ('virtual reality', 24),
 ('local market', 24),
 ('day excursion', 24),
 ('professional photographer', 24),
 ('park güell', 24),
 ('secret food', 24),
 ('train station', 24),
 ('center private', 24),
 ('tour family', 24),
 ('independent day', 24),
 ('dubai burj', 24),
 ('img world', 24),
 ('bash camel', 24),
 ('safari sand', 24),
 ('private vip', 24),
 ('pizza make', 24),
 ('vip tour', 24),
 ('good istanbul', 24),
 ('cruise istanbul', 24),
 ('fromto istanbul', 24),
 ('dolmabahce palace', 24),
 ('center hotel', 24),
 ('castle stonehenge', 24),
 ('london hotel', 24),
 ('day london', 24),
 ('phuket private', 24),
 ('class rome', 24),
 ('private colosseum', 24),
 ('share pasta', 24),
 ('pasta love', 24),
 ('love small', 24),
 ('etna tour', 24),
 ('temple waterfall', 23),
 ('heaven bali', 23),
 ('trek natural', 23),
 ('swing ubud', 23),
 ('tour gate', 23),
 ('waterfall private', 23),
 ('sunrise trekking', 23),
 ('tour lempuyang', 23),
 ('tour airport', 23),
 ('day day', 23),
 ('private shuttle', 23),
 ('tour evening', 23),
 ('pickup dropoff', 23),
 ('tour people', 23),
 ('premium tour', 23),
 ('hill tour', 23),
 ('excursion private', 23),
 ('tour tower', 23),
 ('hopoff bus', 23),
 ('airport barcelona', 23),
 ('electric scooter', 23),
 ('day city', 23),
 ('private roundtrip', 23),
 ('private hour', 23),
 ('group guide', 23),
 ('chania airport', 23),
 ('private yacht', 23),
 ('lose chamber', 23),
 ('desert dune', 23),
 ('private desert', 23),
 ('camel riding', 23),
 ('private way', 23),
 ('north goa', 23),
 ('istanbul new', 23),
 ('change guard', 23),
 ('tour workshop', 23),
 ('paris louvre', 23),
 ('day paris', 23),
 ('musée dorsay', 23),
 ('rome food', 23),
 ('peter basilica', 23),
 ('tour pompeii', 23),
 ('san gimignano', 23),
 ('airport palermo', 23),
 ('bali snorkeling', 22),
 ('tour jungle', 22),
 ('tour dinner', 22),
 ('tour inclusive', 22),
 ('explore ubud', 22),
 ('class private', 22),
 ('lot uluwatu', 22),
 ('tour big', 22),
 ('bali sunrise', 22),
 ('discovery tour', 22),
 ('park ticket', 22),
 ('river tubing', 22),
 ('package tour', 22),
 ('snorkel tour', 22),
 ('park private', 22),
 ('luxury tour', 22),
 ('tour goa', 22),
 ('priority access', 22),
 ('pickup barcelona', 22),
 ('group walk', 22),
 ('wine cheese', 22),
 ('park guell', 22),
 ('hop hop', 22),
 ('barcelona local', 22),
 ('city walk', 22),
 ('line tour', 22),
 ('food market', 22),
 ('yacht cruise', 22),
 ('safari red', 22),
 ('dubai sightseeing', 22),
 ('safari dinner', 22),
 ('park dubai', 22),
 ('lunch dubai', 22),
 ('way transfer', 22),
 ('wadi water', 22),
 ('guide istanbul', 22),
 ('line guide', 22),
 ('day ephesus', 22),
 ('turkey day', 22),
 ('semiprivate max', 22),
 ('airport lcy', 22),
 ('semiprivate tour', 22),
 ('paris eiffel', 22),
 ('tour montmartre', 22),
 ('paris walk', 22),
 ('phuket phi', 22),
 ('fly hanuman', 22),
 ('civita di', 22),
 ('airport cia', 22),
 ('piazza armerina', 22),
 ('ubud monkey', 21),
 ('private fishing', 21),
 ('traditional village', 21),
 ('lunch day', 21),
 ('mother temple', 21),
 ('cooking experience', 21),
 ('good waterfall', 21),
 ('car driver', 21),
 ('denpasar city', 21),
 ('rai airport', 21),
 ('ubud art', 21),
 ('beach day', 21),
 ('vacation photographer', 21),
 ('private boat', 21),
 ('van private', 21),
 ('afternoon tea', 21),
 ('sailing experience', 21),
 ('barcelona montserrat', 21),
 ('city barcelona', 21),
 ('quarter private', 21),
 ('cable car', 21),
 ('barcelona day', 21),
 ('lunch barcelona', 21),
 ('experience local', 21),
 ('skiptheline private', 21),
 ('tour taste', 21),
 ('private morning', 21),
 ('food experience', 21),
 ('palace tour', 21),
 ('ski dubai', 21),
 ('transfer abu', 21),
 ('dubai dinner', 21),
 ('theme park', 21),
 ('prince island', 21),
 ('istanbul include', 21),
 ('istanbul sabiha', 21),
 ('max people', 21),
 ('jack ripper', 21),
 ('buckingham palace', 21),
 ('port london', 21),
 ('tour windsor', 21),
 ('paris versaille', 21),
 ('notre dame', 21),
 ('trip phuket', 21),
 ('rome guide', 21),
 ('colosseum express', 21),
 ('rome pompeii', 21),
 ('airport civitavecchia', 21),
 ('etna wine', 21),
 ('car hire', 20),
 ('bali water', 20),
 ('package private', 20),
 ('heaven gate', 20),
 ...]

The bi-grams were able to pick out important terms, such as 'windsor castle' for London, and 'cooking class' for Sicily. However, words like 'tour' are creating some noise in most of these cities.

Removing Noise from the Data¶

Since there are still lots of words in the word clouds like 'private', 'airport' and 'transfer', I want to try to take those attractions for airport transfers out because they are causing noise in the data.

In [41]:
df.head()
Out[41]:
Attraction City cleaned lemmatized
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london jack ripper walking tour london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london ghost bus tour london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hopon hopoff tour and river cru... big bus london hopon hopoff tour river cruise ...
4 The Blood and Tears Walk: Serial Killers and L... London, United Kingdom the blood and tears walk serial killers and lo... blood tear walk serial killer london horror
In [42]:
# Preview what I want to drop
df.loc[df['Attraction'].str.contains('airport')]
Out[42]:
Attraction City cleaned lemmatized
329 Private transfer from Heathrow airport to Sout... London, United Kingdom private transfer from heathrow airport to sout... private transfer heathrow airport southampton ...
639 Private transfer from city airport to central ... London, United Kingdom private transfer from city airport to central ... private transfer city airport central london
640 private transfer from central london to city a... London, United Kingdom private transfer from central london to city a... private transfer central london city airport
1549 Private airport transfers in London London, United Kingdom private airport transfers in london private airport transfer london
1830 London airport transfer from Heathrow Airport ... London, United Kingdom london airport transfer from heathrow airport ... london airport transfer heathrow airport lhr l...
... ... ... ... ...
1407 Private 4-hour tour of Dubai from hotel, airpo... Dubai, United Arab Emirates private tour of dubai from hotel airport or c... private tour dubai hotel airport cruise loca...
2398 Dubai airport terminal 1,2 or 3 to Ras Al Khaimah Dubai, United Arab Emirates dubai airport terminal or to ras al khaimah dubai airport terminal ras al khaimah
2407 Dubai airport terminal 1,2 or 3 to Ajman Dubai, United Arab Emirates dubai airport terminal or to ajman dubai airport terminal ajman
2408 Dubai airport terminal 1,2 or 3 to Sharjah city Dubai, United Arab Emirates dubai airport terminal or to sharjah city dubai airport terminal sharjah city
3168 Dubai city tour Stop Over pick up from airport... Dubai, United Arab Emirates dubai city tour stop over pick up from airport... dubai city tour stop pick airport morning tour

314 rows × 4 columns

In [43]:
# Get rid of the airport transfer 'attractions'
df2 = df.drop(df.loc[df['Attraction'].str.contains('airport')].index)
In [44]:
df2 = df.drop(df2.loc[df2['Attraction'].str.contains('transfer')].index)
In [45]:
# Just in case, add these words to the stopwords list
stopwords_list += ['airport', 'transfer', 'private']
In [46]:
print(df.shape)
print(df2.shape)
(27533, 4)
(25315, 4)

Create some functions to make the preprocessing steps easier

In [47]:
def preprocess_df(df, column, preview=True, lemmatize=True):
    """
    Input df with raw text attractions.
    Return df with preprocessed text.
    If preview=True, returns a preview of the new df.
    """
    
    df[column] = df['Attraction'].apply(lambda x: x.lower())
    df[column] = df[column].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
    df[column] = df[column].apply(lambda x: re.sub('\w*\d\w*','', x))
    
    if lemmatize:
        df[column] = df[column].apply(lambda x: ' '.join(
                                        [token.lemma_ for token in list(nlp(x)) if (token.is_stop==False)]))
    if preview:
        display(df.head(10))
        
    return df
In [48]:
def group_text_per_city(df, column):
    """
    Groups the preprocessed text per city.
    """
    df_to_group = df[['City', column]]
    df_grouped = df_to_group.groupby(by='City').agg(lambda x:' '.join(x))
    return df_grouped
In [49]:
def create_doc_term_matrix(df, column, count_vec=True, ngram_range=(1,1)):
    """
    Creates a document term matrix.
    Defaults to count vectorizer with optional n-gram param.
    If count_vec==False, uses a TF-IDF vectorizer.
    """
    df_grouped = group_text_per_city(df, column)
    
    if count_vec:
        vec = CountVectorizer(analyzer='word', stop_words=stopwords_list, ngram_range=ngram_range)
    else:
        vec = TfidfVectorizer(analyzer='word', stop_words=stopwords_list)
    
    data = vec.fit_transform(df_grouped[column])
    df_dtm = pd.DataFrame(data.toarray(), columns=vec.get_feature_names_out())
    df_dtm.index = df_grouped.index
    return df_dtm.transpose()
In [50]:
preprocessed_df = preprocess_df(df2, 'lemmatized')
dtm_cv = create_doc_term_matrix(preprocessed_df, 'lemmatized', count_vec=True)

for index, city in enumerate(dtm_cv.columns):
    generate_wordcloud(dtm_cv[city].sort_values(ascending=False), city)
Attraction City cleaned lemmatized
0 SEA LIFE London Aquarium Admission Ticket London, United Kingdom sea life london aquarium admission ticket sea life london aquarium admission ticket
1 The Jack The Ripper Walking Tour in London London, United Kingdom the jack the ripper walking tour in london jack ripper walking tour london
2 Ghost Bus Tour of London London, United Kingdom ghost bus tour of london ghost bus tour london
3 Big Bus London Hop-On Hop-Off Tour and River C... London, United Kingdom big bus london hopon hopoff tour and river cru... big bus london hopon hopoff tour river cruise ...
5 London Ghost and Infamous Murders Walking Tour London, United Kingdom london ghost and infamous murders walking tour london ghost infamous murder walk tour
6 Stonehenge, Windsor Castle, and Bath from London London, United Kingdom stonehenge windsor castle and bath from london stonehenge windsor castle bath london
7 Warner Bros. Studio: The Making of Harry Potte... London, United Kingdom warner bros studio the making of harry potter ... warner bros studio making harry potter luxury ...
8 Ghosts, Ghouls & Gallows: London Virtual Tour London, United Kingdom ghosts ghouls gallows london virtual tour ghosts ghoul gallow london virtual tour
9 High-Speed Thames River RIB Cruise in London London, United Kingdom highspeed thames river rib cruise in london highspeed thames river rib cruise london
10 Alcotraz: Prison Cocktail Experience (Shoreditch) London, United Kingdom alcotraz prison cocktail experience shoreditch alcotraz prison cocktail experience shoreditch
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [51]:
dtm_tfidf = create_doc_term_matrix(df2, 'lemmatized', count_vec=False)

for index, city in enumerate(dtm_tfidf.columns):
    generate_wordcloud(dtm_tfidf[city].sort_values(ascending=False), city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [52]:
dtm_bigram = create_doc_term_matrix(df2, 'lemmatized',
                                    count_vec=True, ngram_range=(2,2))

for index, city in enumerate(dtm_bigram.columns):
    generate_wordcloud(dtm_bigram[city].sort_values(ascending=False), city)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Much better! I removed some of the noise terms out of the top words.

Make Nicer Word Clouds¶

In [53]:
def generate_better_wordcloud(data, title, mask=None):
    cloud = WordCloud(scale=3, max_words=150, colormap='tab20c', mask=mask,
                      background_color='white').generate_from_frequencies(data)
    plt.figure(figsize=(10,8))
    plt.imshow(cloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('\n'.join(wrap(title,60)), fontsize=13)
    plt.show()
In [54]:
mask = np.array(Image.open('../Images/palm2.jpg'))
In [55]:
generate_better_wordcloud(dtm_tfidf['Bali, Indonesia'].sort_values(ascending=False),
                          'Bali, Indonesia', mask=mask)
No description has been provided for this image
In [56]:
mask_london = np.array(Image.open('../Images/ben.jpg'))
In [57]:
generate_better_wordcloud(dtm_tfidf['London, United Kingdom'].sort_values(ascending=False),
                          'London, United Kingdom', mask=mask_london)
No description has been provided for this image
In [58]:
mask_paris = np.array(Image.open('../Images/paris.jpg'))
generate_better_wordcloud(dtm_tfidf['Paris, France'].sort_values(ascending=False),
                          'Paris, France', mask=mask_paris)
No description has been provided for this image
In [59]:
mask_rome = np.array(Image.open('../Images/italy.jpg'))
generate_better_wordcloud(dtm_tfidf['Rome, Italy'].sort_values(ascending=False),
                          'Rome, Italy', mask=mask_rome)
No description has been provided for this image
In [60]:
mask_sicily = np.array(Image.open('../Images/sicily.jpg'))
generate_better_wordcloud(dtm_tfidf['Sicily, Italy'].sort_values(ascending=False),
                          'Sicily, Italy', mask=mask_sicily)
No description has been provided for this image
In [61]:
mask_barca = np.array(Image.open('../Images/spain_flag.jpg'))
generate_better_wordcloud(dtm_tfidf['Barcelona, Spain'].sort_values(ascending=False),
                          'Barcelona, Spain', mask=mask_barca)
No description has been provided for this image
In [62]:
mask_dubai = np.array(Image.open('../Images/dubai.jpg'))
generate_better_wordcloud(dtm_tfidf['Dubai, United Arab Emirates'].sort_values(ascending=False),
                          'Dubai, United Arab Emirates', mask=mask_dubai)
No description has been provided for this image
In [63]:
generate_better_wordcloud(dtm_tfidf['Goa, India'].sort_values(ascending=False),
                          'Goa, India', mask=mask)
No description has been provided for this image
In [64]:
mask_istanbul = np.array(Image.open('../Images/hagia_sophia.jpg'))
generate_better_wordcloud(dtm_tfidf['Istanbul, Turkey'].sort_values(ascending=False),
                          'Istanbul, Turkey', mask=mask_istanbul)
No description has been provided for this image
In [65]:
generate_better_wordcloud(dtm_tfidf['Phuket, Thailand'].sort_values(ascending=False),
                          'Phuket, Thailand', mask=mask)
No description has been provided for this image
In [66]:
generate_better_wordcloud(dtm_tfidf['Majorca, Balearic Islands'].sort_values(ascending=False),
                          'Majorca, Balearic Islands', mask=mask)
No description has been provided for this image
In [67]:
generate_better_wordcloud(dtm_tfidf['Crete, Greece'].sort_values(ascending=False),
                          'Crete, Greece', mask=mask)
No description has been provided for this image

Most Frequent Words Visualizations¶

  • Find the most frequent words per city and visualize them
In [68]:
# Group the corpora by city and join them
df_to_group = preprocessed_df[['City', 'lemmatized']]
df_grouped = df_to_group.groupby(by='City').agg(lambda x:' '.join(x))
df_grouped
Out[68]:
lemmatized
City
Bali, Indonesia hotel hotelbali private transfer daytime bali ...
Barcelona, Spain interactive spanish cooking experience barcelo...
Crete, Greece minoans world museum cinema crete wine ol...
Dubai, United Arab Emirates premium red dune camel safari bbq al khayma ...
Goa, India fontainhas heritage walk sunset cruise paradis...
Istanbul, Turkey bosphorus sunset cruise luxury yacht istanbu...
London, United Kingdom sea life london aquarium admission ticket jack...
Majorca, Balearic Islands cave genova admission palma de mallorca shore ...
Paris, France bateaux parisien seine river gourmet dinner ...
Phuket, Thailand phi phi maiton khai island speedboat phi phi...
Rome, Italy fast skiptheline vatican sistine chapel st pet...
Sicily, Italy etna taormina fullday tour catania palermo str...
In [69]:
bali_text = df_grouped.loc['Bali, Indonesia', 'lemmatized']
fd = FreqDist(word_tokenize(bali_text))
fd.most_common(20)
Out[69]:
[('bali', 2206),
 ('tour', 2190),
 ('private', 1025),
 ('ubud', 869),
 ('day', 669),
 ('temple', 512),
 ('waterfall', 402),
 ('batur', 292),
 ('transfer', 276),
 ('good', 273),
 ('water', 262),
 ('raft', 259),
 ('adventure', 254),
 ('airport', 253),
 ('nusa', 252),
 ('lot', 251),
 ('tanah', 242),
 ('fullday', 231),
 ('package', 226),
 ('sunrise', 225)]
In [70]:
city_freqs = {}
for city in df_grouped.index:
    city_text = df_grouped.loc[city, 'lemmatized']
    fd = FreqDist(word_tokenize(city_text))
    city_freqs[city] = fd.most_common(20)
city_freqs_df = pd.DataFrame(city_freqs)
city_freqs_df.head()
Out[70]:
Bali, Indonesia Barcelona, Spain Crete, Greece Dubai, United Arab Emirates Goa, India Istanbul, Turkey London, United Kingdom Majorca, Balearic Islands Paris, France Phuket, Thailand Rome, Italy Sicily, Italy
0 (bali, 2206) (barcelona, 1105) (private, 340) (dubai, 2144) (goa, 221) (istanbul, 1340) (london, 1860) (mallorca, 209) (paris, 1653) (phuket, 586) (tour, 2616) (tour, 759)
1 (tour, 2190) (tour, 861) (tour, 272) (tour, 1321) (tour, 150) (tour, 1243) (tour, 1463) (tour, 144) (tour, 1152) (tour, 398) (rome, 2611) (private, 554)
2 (private, 1025) (private, 616) (transfer, 259) (desert, 900) (private, 45) (day, 668) (private, 1201) (palma, 134) (private, 949) (island, 316) (private, 1630) (palermo, 422)
3 (ubud, 869) (transfer, 172) (airport, 251) (safari, 898) (day, 39) (private, 578) (airport, 439) (private, 97) (transfer, 297) (phi, 310) (colosseum, 656) (transfer, 317)
4 (day, 669) (airport, 152) (crete, 216) (private, 708) (guide, 34) (airport, 301) (transfer, 427) (de, 90) (airport, 295) (day, 161) (vatican, 649) (etna, 312)
In [75]:
# Make one graph of most frequent words
yaxis = [x[0] for x in city_freqs_df['Bali, Indonesia']]
xaxis = [x[1] for x in city_freqs_df['Bali, Indonesia']]

plt.figure(figsize=(10,8))
sns.barplot(xaxis)
plt.title('Most Frequent Words: Bali, Indonesia')
plt.xlabel('Frequency')
plt.ylabel('Word')
plt.show()
No description has been provided for this image
In [77]:
# Make graphs for each city
for city in city_freqs_df.columns:    
    yaxis = [x[0] for x in city_freqs_df[city]]
    xaxis = [x[1] for x in city_freqs_df[city]]

    plt.figure(figsize=(10,8))
    sns.barplot(xaxis)
    plt.title(f'Most Frequent Words: {city}')
    plt.xlabel('Frequency')
    plt.ylabel('Word')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Modeling¶

Baseline Naive Bayes Model¶

In [ ]:
# Re-import the data to get a fresh start
data = pd.read_csv('../Data/cities_df', index_col=0)
data.head()
In [ ]:
# Perform train/test split before cleaning/preprocessing
X = data['Attraction']
y = data['City']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123)
X_train.shape, X_test.shape
In [ ]:
# Since this is a series, I will need to make it a DF for my preprocessing function
X_train
In [ ]:
X_train_preprocessed = preprocess_df(pd.DataFrame(X_train, columns=['Attraction']), 'lemmatized')
X_test_preprocessed = preprocess_df(pd.DataFrame(X_test, columns=['Attraction']), 'lemmatized')
In [ ]:
stopwords_list = stopwords.words('english')
stopwords_list += list(string.punctuation)
stopwords_list += ['airport', 'transfer', 'private']
In [ ]:
# Vectorize the text data to be suitable for modeling
vectorizer = TfidfVectorizer(analyzer='word', stop_words=stopwords_list, decode_error='ignore')
X_train_tfidf = vectorizer.fit_transform(X_train_preprocessed['lemmatized'])
X_test_tfidf = vectorizer.transform(X_test_preprocessed['lemmatized'])
In [ ]:
def plot_conf_matrix(y_true, y_pred):
    
    """
    Plots a confusion matrix and displays classification report.
    """
    
    cm = confusion_matrix(y_true, y_pred, normalize='true')
    plt.figure(figsize=(15, 15))
    sns.heatmap(cm, annot=True, cmap='Blues', fmt='0.2g', annot_kws={"size": 14},
                xticklabels=nb.classes_, yticklabels=nb.classes_, square=True)
    plt.xlabel('Predictions')
    plt.ylabel('Actuals')
    plt.show()
In [ ]:
def evaluate_model(model, X_train, X_test):
    y_preds_train = model.predict(X_train.todense())
    y_preds_test = model.predict(X_test.todense())

    print('Training Accuracy:', accuracy_score(y_train, y_preds_train))
    print('Testing Accuracy:', accuracy_score(y_test, y_preds_test))
    print('\n---------------\n')
    print('Training F1:', f1_score(y_train, y_preds_train, average='weighted'))
    print('Testing F1:', f1_score(y_test, y_preds_test, average='weighted'))
    print('\n---------------\n')
    print('Train Confusion Matrix\n')
    plot_conf_matrix(y_train, y_preds_train)
    print('Test Confusion Matrix\n')
    plot_conf_matrix(y_test, y_preds_test)
    print('\n----------------\n')
    print(classification_report(y_test, y_preds_test))
In [ ]:
nb = MultinomialNB()
nb.fit(X_train_tfidf.todense(), y_train)
In [ ]:
nb.classes_
In [ ]:
evaluate_model(nb, X_train_tfidf, X_test_tfidf)

Surprisingly, this model performs pretty well. However, the 3 classes with the lowest accuracy and F1 scores are Goa, Majorca, and Crete. These are also the 3 classes with the least attractions, meaning that class imbalance is definitely affecting this model. I can fix this issue using class weights in the next iteration.

Naive Bayes Iteration 2¶

  • Using class weights to improve class imbalance.
In [ ]:
# Compute class weights
class_weights = class_weight.compute_class_weight('balanced',
                                                  np.unique(y_train),
                                                  y_train)
weights_dict = dict(zip(np.unique(y_train), class_weights))
weights_dict
In [ ]:
# Use class weights dictionary to calculate sample weight (needed for MultinomialNB)
sample_weights = class_weight.compute_sample_weight(weights_dict, y_train)
In [ ]:
nb = MultinomialNB()
nb.fit(X_train_tfidf.todense(),
       y_train,
       sample_weight=sample_weights)
In [ ]:
evaluate_model(nb, X_train_tfidf, X_test_tfidf)

This model did really well! Although, in many of these cities' attractions text, the name of the city is included. This may become an issue in the future because when we introduce new text to this model, it may not include the city name.

Iteration 3: What happens if I take the city names out?¶

In [ ]:
new_stopwords = stopwords_list + ['bali', 'barcelona', 'crete', 'dubai',
                                  'istanbul', 'london', 'majorca', 'phuket',
                                  'paris', 'rome', 'sicily', 'mallorca', 'goa']
In [ ]:
vectorizer = TfidfVectorizer(analyzer='word',
                             stop_words=new_stopwords,
                             decode_error='ignore')
X_train_tfidf = vectorizer.fit_transform(X_train_preprocessed['lemmatized'])
X_test_tfidf = vectorizer.transform(X_test_preprocessed['lemmatized'])
In [ ]:
nb = MultinomialNB()
nb.fit(X_train_tfidf.todense(),
       y_train,
       sample_weight=sample_weights)
In [ ]:
evaluate_model(nb, X_train_tfidf, X_test_tfidf)
In [ ]:
# Save the Naive Bayes Model
joblib.dump(nb, '../nb_model')

Much better, because these are more realistic accuracy scores and F1 scores for when we introduce new text to the model.

Iteration 4: Try using Count Vectorization¶

In [ ]:
# Continuing each new iteration without city names 
cv = CountVectorizer(analyzer='word',
                     stop_words=new_stopwords,
                     decode_error='ignore')
X_train_cv = cv.fit_transform(X_train_preprocessed['lemmatized'])
X_test_cv = cv.transform(X_test_preprocessed['lemmatized'])
nb_cv = MultinomialNB()
nb_cv.fit(X_train_cv.todense(),
          y_train,
          sample_weight=sample_weights)
evaluate_model(nb_cv, X_train_cv, X_test_cv)

With count vectorization, the scores are very similar, but still a tiny bit lower than with TF-IDF vectorization, therefore I will keep the TF-IDF vectorization strategy.

Iteration 5: Try using Bi-Grams¶

In [ ]:
bigram = CountVectorizer(analyzer='word',
                         stop_words=new_stopwords,
                         decode_error='ignore',
                         ngram_range=(2,2))
X_train_bg = bigram.fit_transform(X_train_preprocessed['lemmatized'])
X_test_bg = bigram.transform(X_test_preprocessed['lemmatized'])
nb_bg = MultinomialNB()
nb_bg.fit(X_train_bg.todense(),
          y_train,
          sample_weight=sample_weights)
evaluate_model(nb_bg, X_train_bg, X_test_bg)

The bi-grams did well for the training accuracy, but not so great for the testing accuracy. Thus, this model is very overfit, and TF-IDF vectorization is the best vectorization strategy for this dataset.

Iteration 6: Try using a Random Forest Model¶

  • The benefit of this is the ability to see feature importances and get more insight into how the model is working with the text data
In [ ]:
vectorizer = TfidfVectorizer(analyzer='word',
                             stop_words=new_stopwords,
                             decode_error='ignore')
X_train_tfidf = vectorizer.fit_transform(X_train_preprocessed['lemmatized'])
X_test_tfidf = vectorizer.transform(X_test_preprocessed['lemmatized'])
In [ ]:
rf = RandomForestClassifier(class_weight=weights_dict)
rf.fit(X_train_tfidf.todense(), y_train)
In [ ]:
evaluate_model(rf, X_train_tfidf, X_test_tfidf)
In [ ]:
#Get feature importances
feat_imps = pd.Series(rf.feature_importances_,
                      index=vectorizer.get_feature_names_out())
feat_imps[:11]
In [ ]:
top_20_feats = feat_imps.sort_values(ascending=False).head(20)
top_20_feats
In [ ]:
plt.figure(figsize=(10,8))
sns.barplot(x=top_20_feats, y=top_20_feats.index)
plt.title('Top 20 Features')
plt.ylabel('Word')
plt.xlabel('Importance')
plt.show()

This model is also overfit, even though it still performs very well with the test set. Interestingly, the feature importances show a lot of city-specific words, such as 'etna'-- the name of a volcano in Sicily. In the future, it might be a good idea to take these kinds of words out, but for the model's use-case we can leave them in for now.

This model's performance is very good, but random forests have 2 major flaws that will affect this model for its specific use-case:

  1. They are more computationally expensive than Naive Bayes models (AKA they take longer to train and predict)
  2. They use a greedy algorithm, meaning they often favor the bigger class (in this case it would predict Bali much more often than any of the other beach destinations)

For these reasons I still think iteration 3 is the best model so far.

Try out iteration 3 without lemmatization¶

  • One last thing I would like to try is using the cleaned text data without lemmatizing it. I created the preprocessing function to give me this option.
In [ ]:
X_train_cleaned = preprocess_df(pd.DataFrame(X_train, columns=['Attraction']),
                                'cleaned', lemmatize=False)
X_test_cleaned = preprocess_df(pd.DataFrame(X_test, columns=['Attraction']),
                               'cleaned', lemmatize=False)
In [ ]:
vectorizer = TfidfVectorizer(analyzer='word',
                             stop_words=new_stopwords,
                             decode_error='ignore')
X_train_tfidf = vectorizer.fit_transform(X_train_cleaned['cleaned'])
X_test_tfidf = vectorizer.transform(X_test_cleaned['cleaned'])
In [ ]:
nb_cleaned = MultinomialNB()
nb_cleaned.fit(X_train_tfidf.todense(),
       y_train,
       sample_weight=sample_weights)
In [ ]:
evaluate_model(nb_cleaned, X_train_tfidf, X_test_tfidf)

Ultimately, this model performed VERY similar to the lemmatized version. There are a couple of small differences that made me choose this version as my final model:

  1. The test accuracy and F1 scores are a tiny bit higher for this model compared to the lemmatized version. Even though it is only one percent higher overall, the breakdown under each city has increased some of the smaller classes, such as Goa and Phuket.

  2. Lemmatization is more computationally expensive than omitting the lemmatization. It is a small difference, but should still be a consideration.

Therefore, my final model is iteration 3 (Naive Bayes) without lemmatizing the text.

In [ ]:
# Save the best Naive Bayes Model
joblib.dump(nb_cleaned, '../non_lemmatized_model')

Test out the model¶

  • I will ultimately use this model to tell people where they should travel based on what they want to do while on vacation. Let's look at some of the sample predictions this model would give them, using iteration 3 as our final model.
In [ ]:
def preprocess_text(text):
    """
    Input raw text.
    Return preprocessed text.
    """
    
    preprocessed = text.lower()
    preprocessed = re.sub('[%s]' % re.escape(string.punctuation), '', preprocessed)
    preprocessed = re.sub('\w*\d\w*','', preprocessed)
        
    return [preprocessed]
In [ ]:
raw_text = 'I want to go to the beach, go hiking and snorkeling'
preprocessed_text = preprocess_text(raw_text)
preprocessed_text
In [ ]:
nb_cleaned.predict(vectorizer.transform(preprocessed_text))
In [ ]:
preprocessed2 = preprocess_text('Go to historic museums')
print(preprocessed2)
nb_cleaned.predict(vectorizer.transform(preprocessed2))
In [ ]:
preprocessed3 = preprocess_text('Wine tastings, long walks and dinners')
print(preprocessed3)
nb_cleaned.predict(vectorizer.transform(preprocessed3))
In [ ]:
preprocessed4 = preprocess_text('Do yoga on the beach')
print(preprocessed4)
nb_cleaned.predict(vectorizer.transform(preprocessed4))
In [ ]:
preprocessed5 = preprocess_text('Sunset cruises on a yacht with wine')
print(preprocessed5)
nb_cleaned.predict(vectorizer.transform(preprocessed5))

Make this process into a pipeline¶

In [ ]:
# Use OOP to get preprocessing steps into a pipeline
class PreprocessText(TransformerMixin):
    
    def __init__(self):
        self = self
    
    def fit(self, X, y=None, **fit_params):
        return self
        
    def transform(self, X, **transform_params):
        try:
            X = pd.DataFrame(X, columns=['Attraction'])
            X['cleaned'] = X['Attraction'].apply(lambda x: x.lower())
            X['cleaned'] = X['cleaned'].apply(lambda x: re.sub('[%s]' % re.escape(string.punctuation), '', x))
            X['cleaned'] = X['cleaned'].apply(lambda x: re.sub('\w*\d\w*','', x))
            
            X = X['cleaned']
        except:
            pass
        return X
    
class DenseTransformer():

    def __init__(self):
        self = self
    
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()
In [ ]:
# Test preprocessing class
prep = PreprocessText()
prep.transform(X_train)
In [ ]:
pipe = Pipeline(steps=[
                ('TextPreprocessor', PreprocessText()),
                ('TFIDFVectorizer', TfidfVectorizer(analyzer='word',
                                                    stop_words=new_stopwords,
                                                    decode_error='ignore')),
                ('DenseTransformer', DenseTransformer()),
                ('NaiveBayes', MultinomialNB())])
In [ ]:
set_config(display='diagram')
In [ ]:
pipe.fit(X_train,
         y_train, 
         **{'NaiveBayes__sample_weight': sample_weights})
In [ ]:
pipe.score(X_test, y_test)
In [ ]:
pipe.predict(['I want to go snorkeling and tan on the beach'])
In [ ]:
pipe.predict(['Go out for drinks'])
In [ ]:
def evaluate_pipe(pipe, X_train, X_test):
    y_preds_train = pipe.predict(X_train)
    y_preds_test = pipe.predict(X_test)

    print('Training Accuracy:', accuracy_score(y_train, y_preds_train))
    print('Testing Accuracy:', accuracy_score(y_test, y_preds_test))
    print('\n---------------\n')
    print('Training F1:', f1_score(y_train, y_preds_train, average='weighted'))
    print('Testing F1:', f1_score(y_test, y_preds_test, average='weighted'))
    print('\n---------------\n')
    print('Train Confusion Matrix\n')
    plot_conf_matrix(y_train, y_preds_train)
    print('Test Confusion Matrix\n')
    plot_conf_matrix(y_test, y_preds_test)
    print('\n----------------\n')
    print(classification_report(y_test, y_preds_test))
In [ ]:
evaluate_pipe(pipe, X_train, X_test)
In [ ]:
# Save the Pipeline
joblib.dump(pipe, '../final_pipeline')
In [ ]:
# Load pipe from joblib file to test
best_model_pipe = joblib.load('../final_pipeline')
best_model_pipe
In [ ]:
best_model_pipe.predict(['I want to visit art galleries'])

Make Pipeline and Gridsearch for Random Forest¶

Now that I have a pipeline created, I will go back to iteration 6 with model tuning so see if I can get a better score with a less overfit model. I will use a gridsearch to tune the n_estimators and max_depth parameters, which commonly cause overfitting in Random Forest models.

In [ ]:
rf_pipe = Pipeline(steps=[
                ('TextPreprocessor', PreprocessText()),
                ('TFIDFVectorizer', TfidfVectorizer(analyzer='word',
                                                    stop_words=new_stopwords,
                                                    decode_error='ignore')),
                ('DenseTransformer', DenseTransformer()),
                ('RandomForest', RandomForestClassifier(class_weight=weights_dict))])
In [ ]:
# Use a grid search to do some model tuning with the Random Forest (iter. 6)
param_grid = {'RandomForest__n_estimators': [100, 250, 500, 750],
              'RandomForest__max_depth': [5, 7, 9]}
In [ ]:
rf_gridsearch = GridSearchCV(rf_pipe, param_grid=param_grid,
                             verbose=1, scoring='accuracy')
rf_gridsearch.fit(X_train, y_train)
In [ ]:
# See the params from the best model
rf_gridsearch.best_params_
In [ ]:
# Save best estimator in a variable and get accuracy score
best_rf = rf_gridsearch.best_estimator_
y_test_preds = best_rf.predict(X_test)
accuracy_score(y_test, y_test_preds)
In [ ]:
# Compare against training accuracy
y_train_preds = best_rf.predict(X_train)
accuracy_score(y_train, y_train_preds)

Even with the model tuning, this model still is not performing as well as iteration 3. Although the model tuning helped to prevent overfitting, its accuracy score is not as good. Therefore, I will still conclude that the Naive Bayes iteration 3 without lemmatization is the best model.

Get top 2 predictions from best model¶

  • In the dash app, it would be good to give someone a second prediction just in case they have already been to the first place the model predicts.
In [ ]:
probas = best_model_pipe.predict_proba(['I want to visit art galleries'])
probas
In [ ]:
classes = best_model_pipe.classes_
classes
In [ ]:
# First Prediction
classes[probas.argmax()]
In [ ]:
# Second Prediction
classes[np.argsort(probas)[:, 10]][0]

Conclusion¶

My final model is a Multinomial Naive Bayes classifier, which can predict a destination with 81% accuracy and an 82% F1 score (iteration 3 without lemmatization in this notebook). The text data put into this model is not lemmatized, but is lowercased with stopwords removed and city names removed.

Model Fit & Score¶

I used accuracy and F1 score to score this model. Since there are 12 classes, I want to model to be accurate, however, F1 score is also important to consider since there is some class imbalance in the dataset and to account for the model's false positives and false negatives.

The final model had the following training and testing accuracy and F1 scores:

  • Testing Accuracy Score 0.81 | F1 Score 0.82
  • Training Accuracy Score 0.86 | F1 Score 0.86

Looking at the above scores for both accuracy and F1, we can conclude that the model is a tiny bit overfit, but overall very accurate, especially considering that there are 12 classes.

I was surprised that the final/best-performing model was iteration 3 without lemmatization because I thought that lemmatizing the text would help the model's score.

Business Recommendations¶

  • Integrate the Destination Dictionary technology into pages where Top Destination lists are published to drive engagement with future travelers and drive traffic to affiliate links

  • Use the Destination Dictionary technology paired with a chatbot on travel websites to act as a virtual travel agent

  • Offer paid sponsorship of the 'default' city-- ex. Tourism Board of Bali can pay be the first recommended city when you open the page

Next Steps -- Dash App¶

Everything works and is ready for the next step. This model will be put into The Destination Dictionary Dash app for its final use-case: predicting where people should travel based on the activities that they want to do on vacation!

You can see the GitHub Repo for the Dash App here